Google 



This is a digital copy of a book that was preserved for generations on library shelves before it was carefully scanned by Google as part of a project 

to make the world's books discoverable online. 

It has survived long enough for the copyright to expire and the book to enter the public domain. A public domain book is one that was never subject 

to copyright or whose legal copyright term has expired. Whether a book is in the public domain may vary country to country. Public domain books 

are our gateways to the past, representing a wealth of history, culture and knowledge that's often difficult to discover. 

Marks, notations and other maiginalia present in the original volume will appear in this file - a reminder of this book's long journey from the 

publisher to a library and finally to you. 

Usage guidelines 

Google is proud to partner with libraries to digitize public domain materials and make them widely accessible. Public domain books belong to the 
public and we are merely their custodians. Nevertheless, this work is expensive, so in order to keep providing tliis resource, we liave taken steps to 
prevent abuse by commercial parties, including placing technical restrictions on automated querying. 
We also ask that you: 

+ Make non-commercial use of the files We designed Google Book Search for use by individuals, and we request that you use these files for 
personal, non-commercial purposes. 

+ Refrain fivm automated querying Do not send automated queries of any sort to Google's system: If you are conducting research on machine 
translation, optical character recognition or other areas where access to a large amount of text is helpful, please contact us. We encourage the 
use of public domain materials for these purposes and may be able to help. 

+ Maintain attributionTht GoogXt "watermark" you see on each file is essential for in forming people about this project and helping them find 
additional materials through Google Book Search. Please do not remove it. 

+ Keep it legal Whatever your use, remember that you are responsible for ensuring that what you are doing is legal. Do not assume that just 
because we believe a book is in the public domain for users in the United States, that the work is also in the public domain for users in other 
countries. Whether a book is still in copyright varies from country to country, and we can't offer guidance on whether any specific use of 
any specific book is allowed. Please do not assume that a book's appearance in Google Book Search means it can be used in any manner 
anywhere in the world. Copyright infringement liabili^ can be quite severe. 

About Google Book Search 

Google's mission is to organize the world's information and to make it universally accessible and useful. Google Book Search helps readers 
discover the world's books while helping authors and publishers reach new audiences. You can search through the full text of this book on the web 

at |http: //books .google .com/I 



n,gN..(JNGOOglC 



N Google 



N Google 



N Google 



N Google 



n,gN..(jNGoogle 



Twenty-sixth Annual Issue. 
THE OFFICIAL YEAR-BOOK 

SCIEHTIFIC AND LEARNED SOCIETIES OF GREAT BRITAIN 
AND IRELAND. 
OOBIPILED FBOH OFFIOIAL|SOnBCEa, 
Compriainq [togather with other Offieial Information) LISTS of the 
PAPERS read during the Session 1903-7909 before all the LEADINQ 
SOCIETIES throughout the Kingdom engaged in the following 
Departments of Reaearch : — 



oocapflDg bheiuHlvea viUi Kiersl 
finncbes ol Scfgnce, ot wltb 
Sciame ■nd Litsnturp Jalntly, 



i t. Chemlit^ Mid Fl)atogr«pb]r. 
f t. Geology, aeocn[iliy, Bnd kloanlog;. 
} t. Blolon, [niJndfng UicKHcapr and 
Anttoopolog;. 



S. Bconomlc Scleoce and SUtEitlci 
r. M«cluuilcal aclsnce, Englneeri 

■ud Aichilectara. 
i. Hitv«laDdMilltBrT3cl«iica. 
i. AgTiaullars and Hortlonltora. 



" Fills a vbbt real want." — JSngineering. 

" iNDiaPENSABLB to any one who may wish to keep himself 
abreast of the scientific work ot the day." — Edtiifmrgh Medical 
Jownal. 

"Iba Ybab-Booe or BaoiiTiBa is s Bscord wbtcb ought to be of tbs greatMt 
MB for the uroHKM ol Selanca. ' ■ —Lord Playfair, F.R.S., E.O.B., M.P., Patt-FTuiimt 
of Ou BritM Auoaaliim. 

"ItgooBilmoetwItbont HjIngttiBt ■ Hudbook of this sublSEt will be Id time 
OBBOl tha nioit gapertll; naelal worts lor tbe library or Hie daak." — JAa Timt. 

" Briitih Socletlei are bow v jell repreientad In tlia ' Year-Book of tbe Scleutlllc 
and Leamad 8ooletie« ot Great Britain and Iraland.'"— <An. "aocletlee" In Sew 
Sditlanal"BucyclapBdlRBrlUanica,"TOl. nil.) 



Copies of the First Issub, giving an Account of the History, 
Organization, and Conditions of Membership of the various 
Societies, and forming the groundwork of the Series, may atill 
be had, price 7/6. Also Copies of the lB»aes foUowing. 



ot tbe HBiloDBi year in tbe vanoiu Dapartnienla. It Ib need u a mahpbook In all 
out Eieat BOUNTivio Cbhtbeb, HUSKTIIB. and Libiuiuib throngboat Uia Kingdom, 
and hu beoome an mpiaPMiBaBLii BOOK 0> Bapiamia to every ooa engaged In 
BclanUtle Work. ^— ^-— ^^^^^^^^■^— - 

LONDON: CHARLES 6RIFFIN k CO., UNITED, EXETER STREET, STRAND. 



AN INTRODUCTION TO THE 

•THEORY OF STATISTICS* 



G. UDNT ^DLB, 



Mftb 53 figutee and s>iagTams. 



LONDON: 

CHAELES GRIFFIN AND COMPANY, LIMITED. 

PHILADELPHIA r J. B. LIPPINCOTT COMPANY. 

1911. 



nigNjPdNGOOgle 



^^i 



f^' 



■> 



,^3^ 



n,oN.«j-v Google 



PilEFACE. 



Thb foUowiag chapters are based on the couraea of instruction 
given during mj tenure of the Newmarch Lectureahip in Statistics 
at University College, Londonrin the sessions 1902-1909. The 
variety of illustrations and examples has, however, been increased 
to render the book more suitable for the use of biologists and 
others besides those interested in economic and vital statistics, 
and some of the more difficult parts of the subject have been 
treated in greater detail than was possible in a sessional course 
of some thirty lectures. For the rest, the chapters follow closely 
the arrangement of the course, the three parts into which the 
volume is divided corresponding approximately to the work of 
the three terms. To enable the student to proceed further with 
the subject, fairly detailed lists of references to the original 
memoirs have been given at the end of each chapter : exercises 
have also been added tor the benefit, more especially, of the 
student who is working without the assistance of a teacher. 

The volume represents an attempt to work out a systematic 
iiitroductory course on statistical methods — the methods available 
for discussing, as distinct from collecting, statistical data — suited 
to those who possess only a limited knowledge of mathematics : 
an acquaintance with algebra up to the binomial theorem, 
together with such elements of co-ordinate geometry as are now 
generally included therewith, is all that is assumed. I hope that 
it may prove of some service to the students of the diverse 
sciences in which statistical methods are now employed. 

My most grateful thanks are due to Mr R. H, Hooker not only 



for reading the greater part of the mamiscript, and the proofs^ 
and for making many criticiama and BuggostionH which have 
been of the greatest service, but also for much friendly help and 
encouragement without which the preparation of the votunae, 
often delayed and interrupted by the pressure of other work, 
might never have been completed: my debt to Mr Hooker is 
indeed greater than can well be expressed in a formal preface. 
My thanks are also due to Mr H. D. Vigor for some assistance 
in checking the arithmetic, and my acknowledgments to Professor 
Edgeworth for the example used in g 5 of Chap. XVII. to illustrate 
the influence of the form of the frequency distribution on the 
probable error of the median. 

I can hardly hope that all errors in the teit or in the mass- 
of arithmetic involved in examples and exercises have been 
eliminated, and will feel indebted to any reader' who directs- 
my attention to any such mistakes, or to any omissions, am- 
biguities, or obscurities. 

G. U. Y. 

Deeemfier 191P. 



n,gN..(JNGOOglC 



CONTENTS. 



INTKODUCTION. 

1-8. The introdnclaon of the terms " statiatios," " statistical," int« 
the Eoflish loognage— 4-S. The chanf^ in meaning of these 
tarms miring the nineteenth centarj — 7-9, The pregont oae 
of the terms — ID. DeRnitions of "statisticB," statlsticsil 
methods," "theory ofatatiBtics," in accordance with present 



PART I,— THE THEORY OF ATTRIBUTES. 

CHAPTER I. 

NOTATION AND TEBMINOLOaY. 

1-2. StAtisticBofattribalea and statistics of variables: fandamental 
character of the former — 3-5. Classification bjr dichotomy — 
6-7. Notation /or aingle attributes and for combinations — 
8. The olaas-frequBncy— 9. PositiTe and negative attributes, 
oontrariea— 10. The order of a class — 11. The aggregate — 
13. The arrangemsnt of classes by order and aggregate — 
13-H. Sufficiency of the tabnlation of the ultimate olasa- 
freqaencies — 15-17. Or, better, of the positive clasa-fre- 
qaenci«fl — 18. The class frequeocies chosen in the census 
for tabalatioD of statiatics of inhncitiea — 10. Inclusive and 
exclusive notAttoas and terminologies 7-19 

CHAPTER II. 
G0NSI8TEN0E. 

1-8. The field of observatioQ or universe, and its apeoifieatiou by 
aymbolB — 4. Derivation of complex from simple relations by 
specifying the aniverse — 5-6. Conaiatence — 7-10. Con- 
ditions of consistence for one and for two attributes^ 
11-14. Conditions of consistence for three attrihntes . .. 17-!H 



CHAPTER III. 

ASSOCIATION. 

1 of independence — 6-10. The conception of 
, and testirff for the same by the compariaon 
oF percentagea — 11-12. Numerical equality of the differences 
between the four second-order frequencies and their in- 
dependence Valaes— IS. The coefficient of association — 
11. Necessity for an iDvestigation into the cbubbIjod of an 
attribute A being extended to inolnde non-^'s . 

CHAPTER IV. 

PABTIAL ASSOCIATION. 

1-2. Uncertainty in interpretation of an obserTed asBociation — S-6. 
Soorceof the ambiguity ; partial associations — 6-S. lilnaory 
-association due to the association of each of two attributes 
with a third — 9. Estimation of the partial associations from 
the frequencies of the second order— 10-12. The total 
umuber of sasociatious for a given number of attributes — 
18-14. The case of complete independence . 

CHAPTER V. 
MANIFOLD CLASSIFICATION. 

1. The general principle of a. manifold classification— 2-4. The 
table of double entry or contingency table and its treatment 
by fundamental methods — 5-8. The coefficient of contin- 
gency— 9-10. analysis of a contingency table by tetrads 
— 11-13. Isotropic and anisotropic distributions— 14-1 B. 
Homogeneity of the classifications ilealt with in the pre- 
cedinj^ chapters : heterogeneous classifications . 



PART II.— THE THEORY OF VARIABLES. 

CHAPTER VI. 

THE FEEaUENCY-DlSTEIBTTTION. 

. Introductory — 2. Necessity for classitication of observations : the 
frequency -distribution — 3. Illustrations — 4. Method of form- 
ing the table — 5. Magnitude of class-intervals — 6. Position 
of intervals — 7. Process of classification— 8. Treatment of 
intermediate observations — 9. Tabulation — 10, Tables with 



13. The symmetrical distribution — 14. The moderately 
aaymmetrical distribution — 15. The extremely asyinmetri- 
cal or J -shaped distribution — 13. The U-shaped distribution 



CHAPTER VII. 
ATEBAOES. 

. Keceseitj for quantitative deliDilioii of the char 



u <«veT«ges) 
1 average the 



3-13. The arithmetic mean : its definition, calcnlatioa, and 
aimpler properties — 14-18. The median: iu definition, 
calculation, and aimpler properties— 1 9-20. The mode: its 
definition and relation to meaD and median— 21. Summary 
compariBOD of the preceding fonna of average — 22-26. The 
geometric mean ; its definition, simpler properties """'' ''"- 



: its definition and caloulul 



CHAPTER VIII. 

HEASUBES OF DISFEBSION, ETC. 

1. Inadequacy of the range as a measure of dispersion— 
2-13. The standard deviation : its definition, calculation, 
and properties — 14-19. The mean deviation ; its definition, 
osloulation, and properties— 20-24. The quartile deviation 
or semi-interquartile range — 25. Measures of relative dis- 
persion — 26. Measures of asymmetry or skewneas— 27-80. 
The method of grades or percentiles 18^196 

CHAPTER IX. 

CORRELATION. 

1-3. The correlation table and its formation— 4-5, The correlation 
surface— 6-7. The general problem — 8-9. The line of means 
of rows and the line of means of columns ; their relative 
positions in the case of independence and of vaiying degrees 
of correlation— 10-14. The correlation -coefficient ana tlie 
regressions — 16-19. Numerical calculations. ^ 17. Certain 
points to be remembered in calculating and using the 
coefficient 157-lSO 

CHAPTER X. 

CORRELATION: FRAOTIGAL AFFLICATIONS 

AND METHODS. 

9-10, lUoatration ii. ; Inheritance of fertiuty — 11-(18. w.gj^^ 



CONTENTS. 

[lluatratioc iil ; The weather and the crops— 14. Corre- 
lation between the oiovements of two rarinhles : (a) 
Non-periodic movEmenta : Illnstration iv. ; chsnges in 
infftntiia and RenBral mortality— 16^17. (b) Qu»si- periodic 
luovemeuls : lUastnttion v. ; the m&rriage-rate and foreign 
trade^lS. Elementary methods of dealing with cases of 
non-linear regression^! 9. Cert^n rough methods of approxi- 
mtttiag to the correlation -coefficient 191-20S 



CHAPTER XL 

HISGELIANEOnS THEOREMS IHVOLVIHa TEE USEfOF 
THE COHBELATION-COEFFICIENT. 

1. Introdactory — 2, Standard-deTiation of a auni or difference — 
3. Influence of grouping of observations on the atandard- 
dariation — 4-B. Influence of errotH of observation on the 
standard-deviation — 6-7. InHuence of errors of observation 
on tba correlation -coefficient (Spearman's theorema) — 8. 
Meau and standard -deviation of an index — B. Correlation 
between indices — 10. Correlation -coefficient for a two x two- 
fold table — 11. Correlation- coefficient for all possible pairs of 
iV values of a variable — 12. Correlation dne to heterogeneity 
of material — 13. Reduction of correlation dne to mingling 
of uncorrelated with correlated material — 14-17. The 
weighted mean — 18-19. Application of weighting to tho 
correction of death-rates, etc, for varying sen and age- 
-I distributions — 20. The weighting of forme of average other 

than the arithmetic mean 207-224 



CHAPTER XII. 
FABTIAL COBBELATION. 



, Introductory explanation — 3. Direct deduction of the fonnulx 
for two variables — 4. Special notation for the general 
ease r generaliaed regresBious — 6, Generalised correlations— 
6. Generalised deviations and standard - deviations — 
7-8. Theorems concerning the generalised prodnct-aums — 
S. Direct interpretation of the generalised regressions — 
10-11- Reduction of the generalised standard-deviation — 
12. Reduction of the generalised regression — 13, Reduction 
of the generalised correlation -coefficient — 14. Arithmetioal 
work : Example i. ; Example ii. — ^15. Geometrical repre- 
sentation of correlation botween threo variables by means of 
amodel—ie. The coefficient of n-fold correlation— 17. Ex- 
pression of regressions and correlations of lower in terms of 
those of higher order — 18. Limiting inequalities between 
the values of correlatioD -coefficients necessary for consist- 
ence— 19. Fallacies , ^ . ,: 

r,: ..I .CnOCHM 



PART III— THEORY OF SAMPLING. 

CHAPTER XIII. 

SIMPLE SAMFLXNG OF ATTBIBUTES. 

1. The problem of the present Part — 2. The two cbief diriBiouB of 
the theory of aampling — 3. Limitation of the discuBSian to 
the case of aimpla sampling— i. Definition of the chance of 
sacceaB or failure of a given event — 6. Determination of the 
mean and standard-deriatioa of the numher of auccessea in 
n events— 6. The same for the proportion of snocesges in n 
events : the standard -deviation of simple sampling as a 
measure of nnreliability or ita reciprocal aa a measure of 
precision — 7. VeriBcation of the theoretical results bj ei- 
periment^S. More detailed discoseion of the assumptions 
on which the formula for the standard-deviation of simple 
sampline is based — B-10. Biological cases to which the 
theory le directly applicable — 11. Standard-deviation of 
simple sampling when the numbers of observationB in the 
samples vary — 12. Approximate value of the standard- 
deviation of simple sampling, and relation between mean 
aud standard-deviation, when the cliance of success or 
failure is very small— -IS. Use of the stondard -deviation of 
simple sampling, or standaid error, for checking and con- 
trolling the interpretation of statistical results . : 



CHAPTER XIV. 

SIUPLE SAHPLIHO CONTINUED : EFFECT OF BEHOV- 
ING TEE LIMITATIONS OF SIMPLE SAMPLING. 

1. Warning as to the assumption that three times the standard 
error gives the range for the majority of fluctuations of 
simple samplius of either sign — 2. Warning as to the nae 
of the observed for the true value of p in the formula for 
the alaudard error— 8. The inverse standard error, or 
standard error of the true proportion for a given observed 
proportion i equivalaoca of the direot and inverse standard 
errors when n is large— *-8. The importance of errors 
other than fluctuations of "simple sampling" in ptootice: 
nnrepresBotative or biassed samples — 9-10. Efieot of diver' 
genoes from the conditions of simple sampling ; (a) effect 
of variation in p and g for the several universes From which 
the samples are drawn— 11-12. (b) Effect of variation in 
p and q from one sub-class to another within each universe — 
18-H. (c) Efisct of a correlation between the results of the 
several eveuti-lE. Summary . . . . , - &%2B0 

.1 .LiOO^Ie 



CHAPTER XV. 

THE BINOMIAX DISTRIBUTION AMD THE 
NOEMAL CUHVE. 

1-2. Determination of the frequBucy-diatribution for the number 
of suocessea in » events; the binomial distribution — 3. 
Dependence of the form of the distribntion on p, q, &nd n — 
4-6. Graphical and meohanioal methods of forming re- 
presentations of the biaomial distribution — 6. Direct 
calculation of the mean and the standard -deviation from 
the diBtributiou— 7-8. Necessity of deducing, for one in 
many practical cases, a continuous curve giving approxi- 
mately, for large values of n. the terms of the binomial 
aeries— 9. Deduction of the normal curve as a limit to tbe 
eymmetrieal biaomial — 10-11. The value of the central 
ordinate — 12. Comparison with a binomial distribution for 
a moderate value of n — 13. Ocitline of the more general 
conditions from which the curve can be deduced b; advanced 
methods — 14. Fittins the curve to an actual aeries of 
observations— 16. DmcuT^ of a oompleto teat of fit by 
elementsiy methods — 16. The table of areas of the nonni^ 
curve and its use — 17. The quartile deviation and the 
"probable error" — 18. Illustrations of the application of 
the normal curve and of the table of areas . . . . 2{ 



CHAPTER XVI. 

HOSHAL COBBELATION. 

I-S. Deduction of the general expression for the DDrmal correlation 
sur&ce ttotD the cose of independeace^4. Constancy of the 
standard- deviations of parallel arrays and linearity of the 
regression — 5. The contour lines: a series of concentric and 
similar ellipses- — 6, The normal suriaoa for two correlated 
variables regarded as a normal surface for nncorrelated vari- 
ablea rotated with respect to the aies of measurement : 
arrays taken at any angle across the surface are normal 
diatnbutions with constant standard-deviation ; distribution 
of and correlation between linear functions of two normally 
coTrclat«d variables are normal : principal axes — 7. Standard- 
deviations round the principal axes — 8-11. Investigation of 
Table III., Chapter IX, to lest normality: linearity of 
regression, cnnatancy of standard-deviation of arrays, 
normality of distribution obtained, by diagonal addition, 
contour lines — 12-13. Isotropy of the normal distribution 
for two variables — 14. Outline of the principal properties of 
the normal distribntion for n variables .... I 



CHAPTEE XVII. 

THE SmPLEB OASES OF SAUFLIHQ FOB VABIABLES : 
FEBCENIILES AND MEAN. 

1-2. Th« problem of attmpliDg for Tariablea : the conditions 
assumed — 3. Standard error of a percentile — i. Specia) 
valnes for the peroeotiles of a, normal diatribation — S. 
Effectof the form of the distribution generally — S. Simplified 
formula for tiie case of a grouped froqueney-diatribution — 7, 
Correlation between arrors in two peicentileB of the a 



teiquaitile range 



distribation — 8. Standard error of the ii 
for the Dorroal curre— 9. Effect of re 
of simple sampling, and limitations 

Standard error of the sritbmetii; mean — 11. Relative sta- 
bility of mean and median in sampling — 12. Standard error 
ofthe difference between two meana— IS. The tecdency to 
normality of a distribution of meana — H. Effect of removing 
the Featnctions of sinnile sampling — 15. Statement of the 
standarderrorsof Btandard'deviation, coefBcient of variation, 
corielation-coeffioient, and regression — 16. Restatement of 
the limitations of interpretatioD if the sample be small . I 



APFEnsix I.— Tables for faoilitaliDg Statistical Work 



oiTKN 357-364 

Index 866-379- 



n,gN..(JNGOOglC 



N Google 



THEORY OF STATISTICS, 



INTEODUCTION. 

1-3. The introduction of the terms " statisldcs," "atatiatical, "into the Ezigliab 
language —4-6. The change in meaning of these terma during the 
nineteenth century — 7-9, The present uae of the terma— 10. DeliDi- 
tione of " etatiatica," "statistical methods," " theorj of statistica,!! in 
accordance vith present usage. 

1, Thb words "atatist," "statistics," "utatistical," appear to be 
all derived, more or leas indirectly, from the Latin atatut, in the 
senae that it acquired in medissval Latin of a political »tate. 

2. The first term is, however, of much earlier date than the two 
others. The word " statist " occurs, for instance, in Hamlet 
(1602),' CT/mbeline (1610 or I611),2 and in Paradise Regained 
(1671),^ "Statistics" and " atatistical " seem to have been only 
introduced into English in 1787, the earliest known ubgb of the 
terms occurring in the preface to A Political Survey of the Preient 
State of Europe, by E. A, W Zimmermann,* issued in that year. 
" It is about forty years ago," says Zimmermann, " that that branch 
of political knowledge, which has for ita object the actual and 
relative power of the several modem states, the power arising 
from their natural advantages, the industry and civilisation of 
their inhabitants, and the wisdom of their govemmenta, has been 
formed, chiefly by German writers," into a separate science. . , , 
By the more convenient form it has now- received .... this 
science, distinguished by the new-coined name of ttatittict, is 
become a favourite study in Germany " (p. ii) ; and again (p. v), 
" To the several articles contained in this work, some respectable 

'Aoty.,BC. 2. 3ActiL,sc, i. ' Bk, iv. 

' Zimmermsna'a work appears bi JwTB'beeii writttm.ia. English, though hs 
was a German, Professor of Natural ^MosopJlj. at Brmii^iok. L.OOOlc 

1 *■ 



2 THKOBY OP STATISTICS. 

»tati»tical writere have added a view of the principal epochas of the 
history of each country." 

3. Within the next few years the nords were adopted by several 
writers, notably by Sir John Sinclair, the editor and organiser of the 
first Sfatittical Aeeotmt of Scotland,^ to whom, indeed, their intro- 
duction has been frequently ascribed. In the circular letter to the 
Clergy of the Church of Scotland issued in May 1790,^ he states 
that in Germany '"Statistical Inquiries,' as they are called, have 
been carried to a very great extent," and adds an explanatory 
footnote to the phrase "Statistical Inquiries" — "or inquiries 
respecting the population, the political circumstances, the pro- 
ductions of a country, and other matters of stat^ In the 
" History of the Origin and Prt^ress " ' of the work, he tells us, 
" Many people were at first surprised at my using the new words, 
Statietict and Statiitical, as it was supposed that some term in our 
own language might have expressed the same itaeaning. But in 
the course of a very extensive tour, through the northern parts of 
Europe, which I happened to take in 1786, I found that in 
Germany they were engaged in a species of political enquiry, 
to which they had given the name of Statutics ;*.... as I 
thought that a new word might attract more public attention, 
I resolved on adopting it, and I hope that it is now complet«ly 
naturalised and incorporated with our language." This hope 
was certainly justified, but the meaning of the word underwent 
rapid development during the half century or so following its 
introduction. 

4. "Statistics" (statistik), as the term is used by German 
writers of the eighteenth century, by Zimmermann and by Sir 
John Sinclair, meant simply the exposition of the noteworthy 
characteristics of a state, the mode of exposition being^ — almost 
inevitably at that time — preponderantly verbal. The conciseness 
and definite character of numerical data were rec^nised at a 
comparatively early period — more particularly by English writers 
— but trustworthy figures were scarce. After the commencement 
of the nineteenth century, however, the growth of official data 
was continuous, and numerical statements, accordingly, began 
more and more to displace the verbal descriptions of earlier days. 
" Statistics " thus insensibly acquired a narrower slgmficatlSn," viz., 

' Twentj-onsToIs., 1791-99. 

* SlaiisliailAo ' -' — 

Progress . . . ." 

' Loc. eit., p. xm. 

* The AbTiisder Slaiswiueiuchafid^EuropdiaiAen Seiche (17 iS>)o{Q<itt!ned 
Achenwall, Professor of Politics at Oottjngsn, is the volqme m which the word 



INTKODDCTION. 3 

the exposition of the oharacteristios of a State by nwmerieal 
methods. It is difficult to say at what epoch the word came 
definitely to bear this quantitative meaning, but the transition 
appears to.have been only half accomplished even after the founda- 
tion of the Royal Statistical Society in 1835, The articles in the 
first volume of the Jourfial, issued in 1838-9, are for the most 
part of a numerical character, but the official definition has no 
reference to method. " Statistics," we read, " may be said, in the 
words of the prospectus of this Society, to be the ascertain- 
ing and bringing together of those facta which are calculated to 
illustrate the condition and prospects of society." ' It is, however, 
admitted that " the statist commonly prefers to employ figures 
and tabular exhibitions." 

6. Once, however, the first change of meanii^ was accomplished, 
further changes followed. From the name of a science or art of 
state-description by numerical methods, the word was transferred to 
those series of figures with which it operated, as we speak of vital 
statistics, poor-law statistics, and so forth. But similar data 
occur in many connectioiTs^in meteorSlogy, for instance, in anthro- 
pology, etc. Such collections of numerical data were also termed 
"statistics," and consequently, at the present day, the word is 
held to cover a collection_<»f_nuinerioaldata, analogous to those 
which were originally formed for thestudy of the state, on almost 
any subject whatever. We not only road of rainfall " statistics," 
but of "statistics" showing the growth of an organisation for 
recording rainfall.^ We find a chapter headed " Statistics " in a 
book on psychology,' and the author, writing of "statistics con- 
cerning the mental characteristics of man," " statistics of children, 
under the headings bright — average— dull."' We are Informed 
that, in a book on Latin verse, the characteristics of the Virgilian 
hexameter "are examined carefully with statistics."^ 

6. The development in meaning of the adjective " statistical " 
was naturally similar. The methods applied to the study of 
numerical data concerning the state were still termed "statistical 
methods," even when applied to data from other sources. Thus 
we read of the inheritance of genius being treat«d "in a statistical 
manner,"* and we have now "a journal for the statistical 
study of biological problems."^ Such phrases as "the statistical 



' E W. Scripture, The New Psychology, 1897, ohap. ii. 

* (>p. dt p. 18. 

' AtheiKBum, Oct 3, 1903. 

* Frauois Oalton, Heredilary Oenius (Macmillan, ISdB), prefoce. 

' Bicmtelrilca, Cambridge Univ. Prasa, the fiist number issQwi in IBOl. 



4 THEORY OF STATISTICS. 

investigatioQ of the motion of molecules " ^ have become part of 
the ordinary language of physiciBte. We find a work entitled 
"the principles of statistical mechanics,"* and the Bakerian 

lecture for 1909, by Sir J. Larmor, was on "the statistical and 
tbermodynamical relations of radiant energy," 

7. It is unnecessary to multiply such instances to show that the 
words "statistics," "statistical," no longer bear any necessary 
reference to " ma-tters of state." They are applied indifferently in 
physics, biology, anthropology, and meteorology, as well as in the 
social Bcieuces. Diverse though these cases are, there must be 
some community of charact«r between them, or the same terms 
and the same methods would not be applied. What, then, is this 
common character ? 

8. Let us turn to social science, as the parent of the methods 
termed " statistical," for a moment, and consider its characteristics 
as compared, say, with physics or chemiatry. One characteristic 
stands out so markedly that attention has been repeatedly 
directed to it by " statistical " writers as the source of the peculiar 
difficulties of their science^ the observer of »ocial facta cannot ex- 
periment, biit mugt deal with cireumslajices a» they occur, apart 
from his control. Now the object of experiment is to replace the 
complex systems of causation usually occurring in nature by 
simple systems in which only one causal circumstance is permitted 
to vary at a time. This simpliiication being impossible, the 
observer has, in general, to deal with highly complicated cases of 
multiple causation — cases in which a given result may be due to 
any one of a number of alternative causes or to a number of 
different causes acting conjointly. 

9. A little consideration will show, however, that this is also 
precisely the characteristic of the observations in other fields to 
which statistical methods are applied. The meteorologist, for 
example, is in almost precisely the same position as the student 
of social science. He can experiment on minor points, but the 
records of the barometer, thermometer, and rain gauge have to be 
treated as they stand. With the biologist, matters are in some- 
what better case. He can and does apply experimental methods 
to a very lai^e extent, but frequently cannot approximate 
closely to the experimental ideal ; the internal circumstances of 
animals and plants too easily evade complete control. Hence a 
large field (notably the study of variation and heredity) is left, 
in which statistical methods have either to aid or to replace the 
methods of experiment. The physicist and chemist, finally, 

' Cletk Maxwell, "Theory of Heat" (1871), aiid "On BoltnnsDu's 
Thsorem"{lS78), Omb. Phil. Trans., vol. jii ., , 

» By J. Wil!anlGibbB{M»cniill»n, 1903X ..i- L.t)0'^le 



INTEODDCTTON. 5 

stand at the other extremity of the scale. Theirs are the 
sciences in which experiment has been brought to its greatest 
perfection. But even so, statistical methods atill find appLcation. 
In the first place, the methods available for eliminating the effect 
of disturbing circumstances, though continually improved, are not, 
and cannot be, absolutely perfect. The observer himself, as well 
as the observing instrument, is a source of error ; the effects of 
changes of temperature, or of moisture, of pressure, draughts, vibrar 
tion, cannot be completely eliminated. Further, in the problems 
of molecular physics, referred to in the last sentences of § 6, 
multiplicity of causes is of the essence of the case. The motion 
of an atom or of a molecule in the middle of a swarm is dependent 
on that of every other atom or molecule in the swarm. 

10. In the light of this discussion, we may accordingly give the 
following definitions : — 

By statiBticB we mean quantitative data affected to a marked 
extent by a multiplicity of causes. 

By statistical methods we mean methods specially adapted to 
the elucidation of quantitative data affected by a multiplicity of 

By theory of statistics we mean the exposition of statistical 
methods. 

The insertion in the first definition of some such words as " to 
a marked extent " is necessary, since the tenn " statistics " is not 
usually applied to data, like those of the physicist, which are 
affected only by a relatively small residuum of disturbing causes. 
At the same time, " statistical methods " are applicable to all such 
cases, whether the influence of many causes be lai^e or not. 



The Histoi7 of the Words " Statistics," " Statistical." 

(1) John, V., Der Name Stalialik ; Weiss, Berne, 1883. A translation in 

Jovr. Bog. St(U. Soc, for same year. 

(2) Ydle, G- U., " The Introduction of the Words ' Statiatica,' * Statistical,' 

into the English Language," Joar. Soy. Slot. Soc., vol. IzviiL, 1905, 
p. 391, 

The History of Statistics in Oeneial. 

(S) John, V., OeschicJiU der Statislik, 1" Teil, bis auf Quetelet ; Enke, 
Stuttoart, 1884. <A11 pblislied ; the author died in 1900. Bj far the 
best history of atatistica dowli to the early years of the nineteenth 
century.) 

(4) MOHL, Robert von, QeachUhU itnd LUitralwr der BtaatsxDiaaenschnften, 
3 vols. ; Enke, Brlangen, 18.>5'{>8. (For history of statistics see 
principally latter half of vol. iii.) 



6 THEOBT OP 8TATI8TIC8. 

(G) Qabaolio, Antonio, Ttoriit gearriiU dtlla seatislica, 2 vols. ; Hoepli, 
Miluio, 2ndedii., 1SS8. (Vol L. Parte sloTica.) 

Several works on theory of eUtietica include short histories, e.g. 
H. We8t«r(!»»rd'B DU Orandtiige der ThtorU der Statatik (FiBcher, 
Jena, 1890), and P. A. Heitzen'a QeichichU, T/uarie vnd Ttelmik der 
SlalisliJc (nen edn., 1008 ; American translation by K. P. Falkner, 
!3ei). There is no deUiled history in Eagfish, but the article 
"Statistics" in Iha Eneyelopcedia Brilannita (9th ednO gives a sketch, 
and the biographical articles in Falgrave's Dictionary of Polilical 
Economy are aseful. For its importance ae regards the English school 
of political arithmetic, reference may also be made tn — 

(«) Hull, C, H,, The EconomU Writings of Sir William PeUy, together 
with lh» Obiemations on the Bills of MartalUy more probably by Ct^}tain 
John GrawU, Cambridge University Press, 2 vols., 189fl. 

History of Ttieorr of StaUstics. 

Somewhat slight information is given in the general works cited. 
From the purely matbematical side the following is important : — 
(7) ToDHUNTBB, L, A HUtory of the MathemiUical Theory of ProbaMliCy 
/rem the timt of Faecal lo that of Laplace ; Macmillan, 1S66. 



N Google 



PART I.— THE THEORY OP ATTRIBUTES. 

CHAPTER I. 

NOTATION iND TERUIN0L0a7. 

1-2. Slatistica of attributes and statiatics of Tariables : fundamental charocter 
of the former — 3-6. Classification by diohotomj— 6-7. Notation for 
single attributes and for combinations — 8. The claea-fi'equency — 9. 
Positive and negative attributes, contraries — 10. The order of a class — 
11. The aggregate — 12. Tlie arrangement of classes by order and 
aggregate — 18-14. Sufficiency of the tabulation of the ultimate class- 
frequencies- — ltl-17. Or, better, of the positi™ class-fi-ei^uenciBS — 18. 
The el ass -frequencies chosen in the census for tabulation of statistics 
of infirmities^lB. Inclaaiveandeicluaive notations and terminologies. 

1. Thb methods of Btatiatics, aa defined in the Introduction, 
deal with quantitative data alone. The quantitative character 
may, however, arise in two different ways. 

In the first place, the observer may note only the pretence or 
ahteTice of some attribute in a series of objects or individuals, and 
count how many do or do not pOBsesa it. Thus, in a given 
population, we may count the number of the blind and Heeing, 
the dumb and speaiing, or the insane and sane. The quantitative 
character, in such cases, arises solely in the counting. 

In the second place, the obaerver may note or measure the 
actual magnitude of some variable character for each of the 
objects or individuals observed. He may record, for instance, the 
ages of persons at death, the prices of different samples of a 
commodity, the statures of men, the numbers of petals in flowers. 
The observations in these cases are quantitative ab initio. 

2. The metlioda applicable to the former kind of obaervations, 
which may be termed statistics of attributes, are also applicable 
to the latter, or statistics of variables. A record of statures of 
men, for example, may be treated by simply counting all measure- 
ments as tail that exceed a certain limit, neglecting the magnitude 
of excess or defect, and stating the numbers of tail and ihort (or 



8 THKORY OF STATISTICS. 

more strictly not-tall) on the basis of this clasaifloation. Similarly, 
the raethode that are specially adapted to the treatment of 
etatisticB of variables, making use of each value recorded, are 
. available to a greater extent than might at first sight seem possible 
for dealing with atatiatics of attributes. For example, we may 
treat the presence or absence of the attribute as correaponding to 
the changes of a variable which can only possess two values, aay 
and 1. Or, we may assume that we have really to do with a 
variable character which has been crudely cla^ifi^, as suggested 
above, aod we may be able, by auxiliary hypotheses as to the 
nature of this variable, to draw further conclusions. But tbe 
methods and principles developed for the case in which the observer 
only notes tbe presence or absence of attributes are the simplest 
and most fundamental, and are best considered first. This and 
the next three chapters (Chaptors I.-IV.) are accordingly devoted 
to the Theory of Attributes. 

3. Tbe objects or individuals that possess the attribute, and 
those that do not possess it, may be aaid to be members of two 
distinct classes, the observer claBsifyiiig the objects or individuals 
observed. lu the airaplest case, where attention is paid to one 
attribute alone, only two mutually exclusive classes are formed. 
If several attributes are noted, the process of classification may, 
however, be continued indefinitely. Those that do and do not 
possess the first attribute may be reclassified according as they do 
or do not possess the second, the members of each of the sub- 
classes so formed according as they do or do not possess tbe 
third, and so on, every class being divided into two at eaoh step. 
Thus the members of the population of any district may be 
classified into males and females ; the membere of each sex into 
sane and insane ; the insane males, sane males, insane females, 
and sane females into blind and seeing. If we were dealing with 
a number of peas (Pimm lativum) of different varieties, tbey 
might be classified as tall or dwarf, with green seeds or yellow 
seeds, with wrinkled seeds or round seeds, so that we would have 
eight classes — tall with round green seeds, tall with round yellow 
seeds, tall with wrinkled green seeds, tall with wrinkled yellow 
seeds, and four similar classes of dwarf plants. 

4. It may be noticed that the fact of classification does not 
necessarily imply the existence of either a natural or a clearly 
defined boundary between the two classes. The boundary may 
be wholly arbitrary, e.g. where prices are classified as above or 
below some special value, barometer readings as above or below 
some particular height. The division may also be vague and 
uncertain : sanity and insanity, sight and blindness, pass 
into each other by such fine gradations that judgments may 



NOTATIOK AND TERMINOLOGY. 9 

differ as to the olaas in which a given indiTidual should be 
entered. The possibility of uncertainties of this kind should 
always be borne in mind in considering statistics of attributes : 
whatever the nature of the classilication, however, natural or - 
artificial, definite or uncertain, the final judgment must bo de- 
cisive ; any one object or individual must be held either te possess 
the given attribute or not. 

5. A classification of the simple kind considered, in which each 
class is divided into two sub-classes and no more, has been termed 
by logicians classification, or, to use the more strictly applicable 
t«rm, diviaion by dichotomy (cutting in two). The classifica- 
tions of most statistics are not dicbotomous, for most usually a 
class is divided into more than two sub-classes, but dichotomy is 
the fundamental case. In Chapter V. the relation of dichotomy 
to more elaborate (manifold, instead of twofold or dichotomous) 
processes of classification, and the methods 'applicable to some 
such oases, are dealt with briefly. 

6. For theoretical purposes it is necessary te have some simple 
notation for the classes formed, and for the numbers of observa- 
tions assigned to each. 

The capitals A, B, C, . . . will be used to denote the several 
attributes. An object or individual possessing the attribute A 
will be termed simply A. The class, all the members of which 
possess the attributed, will be termed th« clasi A. It b con- 
venient to use single symbols also to denote the abience of the 
attributes A, B, C, . . . We shall employ the Greek lettors, a, 
(i, y, . . . Thus if A represents the attribute biindnew, a 
represente tiffkt, i.e. non-blindness ; if B stands for deafneei, fi 
stands for hearing. Generally " a " is equivalent to " non-A," or 
on object or individaat not possetwing the attribute A ; the dose a 
is equivalent to the clau none of the membert of which poiieu the 
attribute A. 

7. Combinations of attributes will be represented by juita- 
positions of letters. Thus if, as above, A represents blindnesa, B 
deafncte, AB represents the combination blindnest and deafnest. 
If the presence and absence of these attributes be noted, the four 
classes so formed, viz, AB, Afi, aB, a^, include respectively the 
blind and deaf, the blind but not-deaf, the deaf but not-blind, and 
the neither blind nor deaf If a third attribute be noted, e.g. in- 
sanity, denoted say by C, the class ABO, includes those who are 
at once deaf, blind, and intone, ABy those who are deaf and blind 
Init not tiuone, and so on. 

Any letter or combination of letters like A, AB, aB, ABy, by 
means of which we specify the characters of the members of a class, 
may be termed a class symbol ,-. , 

- n,gN..(jNGoogle 



10 THEORY OF STATISnCS. 

8. The number of obeervations assigned to any class is termed, 
for brevity, the freqnency of the class, or the oluBS-lreqiieiKy, 
Glass- frequencies will be denoted by enclosing the corresponding 
olasB-symbole in brackets. Thus — 



\aB) „ „ Ata, „ pouewEnt BttrlWtM .1 and S 

(ofl) „ „ US'!, „ ,. „ 6 bat not it 

(ABC) „ „ ABC-i, „ „ „ A,B,aiiiC 

(e£C) „ „ bBCs, „ „ „ B mac bat not A 

laeO „ „ w9C«, „ „ „ C bDl Delttaar J nor B 

and so on for any number of attributes. If A represent, as in 
the illustration above, blindness, B deafness, C insanity, the 
symbols given stand for the numbers of the blind, the iiot-blind, 
the Idind and deaf, the deaf but not blind, the blind, deaf, and in- 
*ane, the deaf and intane hut not blind, and the intane but neither 
blind nor deaf, respectively. 

9. The attributes denoted by capitals ABC, . . . may be 
termed positiTQ attributes, and their contrarie* denoted by Greek 
letters negative attributes. If a class-symbol include only 
capital letters, the class may be termed a positive class ; if only 
Greek letters, a negative class. Thus the classes A, AB, ABC ' 
are positive classes ; the classes a, a^, a^-y, negative classes. 

If two classes are such that every attribute in the symbol for 
the one is the negative or contrary of the corresponding attribute 
in the symbol for the other, they may be termed contrary classes 
and tbeir frequencies contrary frequencies ; e.g. AB and nji, Afi 
and aB, AfiC and aBy, are pairs of contraries, 

10. The classes obtained by noting say n attributes fall into 
natural groups according to the numbers of attributes used to 
specify the respective classes, and these natural groups should be 
borne in mind in tabulating the class-frequencies, A class 
specified by r attributes may be spoken of as a class of the rth 
cooler and its frequency as a frequency of the rth order. Thus AS, 
AG, BC are classes of the second order; (A), (Afi), (oBC), 
(AByD), class-frequencies of the first, second, third, and fourth 
orders respectively. 

11. The classes of one and the same order fall into further 
groups according to the actual attributes specified. Thus if three 
attributes A, B, have been noted, the classes of the second order 
may be specified by any one of the pairs of attributes AB, AC, or 
BG (and their contraries). The series of classes or class-frequen- 
cies the symbols for which are derived from any one positive 

■ class by substituting Greek letters for one or more of the italic 
capital letters in every possible way will be termed an aggre^te. 
Thus {AB) {Afi) {a£} (a^) form an aggregate of frequenciw of 



NOTATION AND TKKMINOLOGY, 



11 



the second order, and the twelve elassee of the second order which 
can be formed where three attributes have been noted may be 
grouped into three such aggregates. 

12. Class-frequencies should, in tabulating, be arranged so that 
frequencies of the same order and frequencies belonging to the 
same aggregate are kept together. Thus the freqneuciea for the 
case of three attributes should be grouped as given below ; the 
whole number of observations denoted hj the letter ff being 
reckoned as a frequency of order zero, since no attributes are 
specified :- 



Order 0. 
Order 1. 



(AB) 



(AC) 

Mr) 



(») 
(y) 

(Jfy) 
VC) 



{ABC) 
(ABy) 
(J/JC) 



{•BC) 

My) 
(.« 
Wr) 



13. In such a complete table for the case of three attributes, 
twentj-seTen distinct frequencies are given : — 1 of order zero, 6 
of the first order, 12 of the second, and 8 of the third. It 
is, however, in do case necessary to give such a complete 
statement. 

The whole number of observations must clearly be equal to the 
number of J's tt^ther with the number of a'a, the number of 
,^'b to the number of A'a that are S together with the number of 
A'a that are not B ; and so on, — i.e. anyclatt-frequency can alteaj/s 
be expressed in terms of ckus-frequencies of higher order. Thus — 

= (.a£) + (J/J) + (aB)+(a^) = etC. I 

(A) = (AB) + {Ali) = {AO) + {Ay) = eUi.[ 

(AS) = (ABC) + (ABy) = etc. J 

Hence, instead of enumerating all the frequencies as under (1), 
no more need be given, for the case of three attributes, than 
the eight frequencies of the third order. If four attributes had 
been noted it would be sufficient to give the sixteen frequencies of 
the fourth order. 

The classes specified by all the attributes noted in any case, 
i.e. classes of the nth order in the case of n attributes, may be 



(2) 



12 THROBY OF STATISTICS. 

termed the ultimate clasBes and their frequencies the ultimate 
firequencies. Hence we may say that it i« mci-er netxtaary to 
enwmerate more than the ultimate frequencies. All the others can 
be obtained from these by simple addition. 

Example i. — (See reference 5 at the end of the chapter.) 
A number of school children were examined for the presence 
or absence of certain defects of which three chief descriptions 
were noted, A development defects, B nerve signs, C low 
nutrition. 

Given the following ultimate frequencies, find the frequencies 
of the positive classes, including the whole number of obser- 
vations JV'. 

(ABC) 57 (aBC) 78 *' 

(ABy) 281 (aBy) 670 

{ApO 86 (a^C) 65 

{A0y) 453 (a^y) 8310 

The whole number of observations AT is equal to the grand 
total: iV= 10,000. 

The frequency of any first-order class, e.g. (A) is given by the 
total of the four third-order frequencies, the class-symbols for 
which contain the same letter — 

(ABC) + (ABy) + (A^C) + (A^) = (A) = 877. 

Similarly, the frequency of any second-order class, e.g. (AB), is 
given by the total of the two third-order frequencies, the class- 
symbols for which both contain the same pair of letters — 

(ABC) -H (ABy) = (AB) = 338. 

The complete results are — 



JV 


10,000 


(AB) 


W 


877 


(.AC, 


w 


1,086 


(BC) 


c 


■286 


{ABO 



57 

14. The number of ultimate frequencies in the general case of 
n attributes, or the number of classes in an t^gregate of the nth 
order, ia given by considering that each letter of the class-symbol 
may be written in two ways (A at a, B or p, C or y), and that 
either way of writing one letter may be combined with either 
way of writing another. Hence the whole number of ways in 
which the class-symbol may be written, i.e. the number of 
classes, is — 

2x2x2x2 . . . . =2". 



NOTATION AND TERMINOLOGY. 13 

The ultimate freiiuencies form one natural set in terms of which 
the data are completely given, hut any other set containing the 
same number of algebraically independent frequencies, viz. 2", 
ma; be chosen instead. 

15. The positive cUbss-frequeDcies, including under this head the 
total number of observations N, form one such set. They are alge- 
braically independent; no one positive class-frequency can be ex- 
pressed wholly in terms of the others. Their number ie, moreover, 
2", as may be readily seen from the fact that if the Greek letters 
are struck out of the symbols for the ultimate claseee, they become 
the symbols for the positive classes, with the exception of 0^87 
• ik . . for which N must be substituted. Otherwise the number 
ia made up as follows : — 



Order 2. (The number of oorabinstiona of n things 2 together) — pg — ■ 

Orders. (ThenmnberofoombiiiatioiM of 11 things 3 together) j-^^ 

and so on. But the series 

n(^-l) ^^-l)(n-2) , 
l-Hn-h^|-2— -H ^2^3 + 

is the binomial expansion of (1-1-1)° or 3", therefore the total 
number of positive classes is 2". 

16. The set of positive class-frequencies is a most convenient 
one tor both theoretical and practical purposes. 

Compare, for instance, the two forms of statement, in terms of 
the ultimate and the positive classes respectively, as given in 
Example i., § 13. The latter gives directly the whole number of 
observations and the totals of A's, B's, and C's. The former gives 
none of these fundamentally important figures without the perfor- 
mance of more or less lengthy additions. Further, the latter gives 
the second-order frequencies {AB), (AC), and (BC), which are ueces- 
Bary for discussing the relations BuhsiBting between A, B, and C, but 
are only indirectly given by the frequencies of the ultimate classes, 

17. The expression of any class-frequency in terms of the 
positive frequencies" is most easily obtained by a process of step- 
by-step substitution ; thus — 

.N-{A)-{B)i.{AB) (3) 

(«ft')-W)-WC1 

.S-{A)-(B) + {AB)-UC)-t{,,BC) 
-Jr-{A}-{B)-la) + (AB)t(,ACr) + iBC)-iABC) (4) 



14 THEORY OF STATISTICS. 

Arithmetical work, howerer, should be executed from firat 
principlea, and not by quoting formulra like the above. 

Example ii. — Check the work o£ Example L, § 13, by finding the 
frequencieB of the ultimate clasees from the frequencies of the 
positive classes. 

(ABy) = lAB) - {ABO) - 338 - 57 = 281 
(A^y) = (Ay) ^ (ABy) = (A) ~ (AC) - (ABy) 

-877- 143-281=453 
(a/3y) = (Py) - {A^y) = N-{B)-(C) + (BC) - {A^y) 

- 10,000 - 1086 - 286 + 1 35 - 453 

= 10,136-1825-8310 
and so on. 

18. Examples of statistics of precisely the kind now under 
consideration are afforded by the censua returns, e.ff., of 1891 or 
1901, tor England and Wales, of persons suffering from different 
"infirmities," any individual who is deaf and dumb, blind or 
mentally deranged (lunatic, imbecile, or idiot) being required to 
be returned aa Hucb on the schedule. The claaaea chosen for 
tabulation are, however, neither the positive nor the ultimate 
claaaea, but the following (neglecting minor distinctiona amongat 
the mentally deranged and the returns of peraona who are deaf 
but not dumb); — Dumb, blind, mentally deranged; dumb and 
blind but not deranged ; dumb and deranged but not blind ; 
blind and deranged but not dumb ; blind, dumb, and deranged. 
If, in the symbolic notation, deaf-mutiam be denoted by A, blind- 
neaa by B, and mental derangement by C, the clasa-frequencies 
thus given are {A), (B), (C), (ABy), (AfiC), (aMC), (ABO) (cf. 
Oenrui of England and Wales, 1891, vol. iii., tables 15 and 16, 
p, Ivii. Centue of 1901, Svmniiarii Tables, table ilix.), Thiaaet of 
frequenciea does not appear to possess any special advantages. 

19. The symbols of our notation are, it ahould be remarked, 
used in an incluaive aenae, the aymbol A, for example, aignifying 
an object or individual possessing the attribute A with or without 
others. This seems to be the only natural use of the symbol, 
but at least one notation has been constructed on an exdiaim 
basis (cf. ref. 5), the symbol A denoting that the object or in- 
dividual poHsesHea the attribute A, but not 5 or C or D, or what- 
ever other attributes have been noted. An exclusive notation ia 
apt to be relatively cumbrous and also ambiguous, for the reader 
cannot know what attributes a given symbol excludes until be 
has seen the whole list of attributes of which note baa been 
taken, and this list he must bear in mind. The statement that 
the symbol A ia used excluaively cannot mean, obvioualy, that the 
object referred to possesses only the attnhu|« A ai^d^^ft^^^^ra 



MOTATION AND TERMINOLOGY. 15 

whatever ; it merely eicludea the other attributea' noted in the 
particular investigation. Adjectives, aa well as the symbola which 
may represent them, are naturally used in an inclusive sense, and 
care should therefore be taken, when claeaes are verbally described, 
that the description is complete, and states what, if anything, ia 
excluded as well as what is included, in the aame way as our 
notation. The terminology of the English census has not, in 
this respect, been quite clear. The " Blind " includes those who 
are " Blind and Dumb," or " Blind, Dumb, and Lunatic," and so 
forth. But the heading " Blind and Dumb," in the table relating 
to " combined infirmities," is used in the sense " Blind and Dumb, 
but not Lunatic or Imbecile," etc., and so on for the others. In 
the first table the headings are inclusive, in the second excluaive. 

K£FERENCES. 

(1) Jktobb, W. Stanlkv, "On a General Byatsm of NumeriosUy Definite 

Keasaning," if»nioiT3 of th« MaiicltesUT Lit, and Phil. Sne., 1870. 
Beprint«d in Fvrt Logic and other Minor Works ; UacmiUan, 1S90. 
{The method used in these chapters is that of Jevons, with the notation 
slightly iDodilied to that employed in the next three memoirs cited. ) 

(2) Vdle, 0, U., "On the Association of Attributes in Statistics, eto. ," Phil. 

Trant. Ray. Son., Seriea A, vol. oicir., 1800, p, 267. 

(3) YutE, G. U., "On the Theory of Consistence of Logical Class- Frequenoies 

■nd its GeomBtrical Eopresentation," Phil. Trane. Boy. Sot., Series A, 
vol. eiovii. , lSOI,p. 91. 

(4) Yule, G. U., "Notes on the Theory of Association of Attributes in 

atatiatice," Siomeirita. vol.ii., 1903, p. 121. (The first three sections 
of (4) are an abstract of (2) and (3). The remarks made aa renrda the 
tabalation of class- frequencies at the end of (2) should be reM in con. 
nection with the remarks made at the begiuning of (3) and in this 
chapter ; of. footnoli on p. 94 of (3), 

Material has been cited from, and rerorence made to the notation used in — 
(G) Warner, F., and others, " Report on the Scientific Study of the Mental and 

Phjsieal Conditions of Childhood"; published by the Committee, 

Parkes Museum, 1896. 
(fl)WABNiR, F., "Mental and Physical Conditions among Fifty Thousand 

Children, etc.," Jour. Roy. Slat. Soe., vol. lix., 1886, p. 125. 



denotes development defects ; B, n 





{ABC) 
(AOy) 


149 

738 

226 

1,198 








classes 





■, Goo»^lc 



THEORY OF STATISTICS. 



N 


23,718 


{AB) 


{A) 


1,918 


i^C) 


(-B) 


2,016 


(BO 


(C) 


770 


(.ABC) 



Find tbe freqnenctM of the ultimate classen. 

3. (Figures from Cemut, Ei\glaiid and Wales, IS91, vol. iii) Convert the 
ctnsiis statement as below into a statement in tcims of (a) the posiUve, (6) 
the ultimate class- frequeDcies. ^^blindneaa, £= deaf-mutism, C=meDtal 
denngement 



Jf 


29,003.625 


iABy) 


M) 


2S,4S7 


IA0C) 


(J) 


11,192 


(«fi£7) 


(P) 


87,383 


(ABC) 



26 

4. (Cf. Mill's Logic, bk. HL, ch. zrii, and ref. (1).) Show that if A 
occurs in ■ larger proportion of the cases where B is than where B ie not, 
then will B occur in a larger proportion of the cases where A is tlian where 
A ia not: U. gi«ii {AB)I{B)>{A$)I(B), show that (^B)/(^)>(.S)/{»). 

5. (Cy. DeMorgan, /in-nia(£oir£«, p. lB8,andref.{l).) Most ffs are ■^'s, 
most B'b are Cs : And the least Dumber of A'u that are Cs, i.t, the lowest 
possiblB value of (,AC), 

8. OiTen that 

(^) = {a)={5) = (fl) = 4iV, 

show that 

(AB) = (^),iAB)^(aB). 

1. {,Cf. ref. (2), g 9, " Oaae of eqnalitj of contraiies.") Oiren that 

(^) = W = (fi)={fl)=(C}=(T) = iA', 
and also that 

(ABO = {^-,\ 
show that 

2 {^SC} = [^S) + {-^(7)-^(fi(7)-iJV- 
8. Measurements are made on a thousand husbands and a thousand wives. 
If the meaeurementB of the husbands exceed tbe measarements of the wives in 
800 cases for one measurement, in 700 cases for another, and in 660 caaea for 
both measurements, in how many cases will both measurements on the wife 
exceed the measurements on the husband ? 



n,gN..(JNGOOglC 



CHAPTER II. 
CONSISTENCE. 

1-3. The field of obaorvation or unirerae and it» ipecifioUon by Brmbols — 
4, DeriTation of complex ftom siinple relatioos by speci^iug tbe 
aniveree — 6-6. GonsiBtflncB — 7-10. Conditions of conBistence for ona 
and for two attributes— 11-14. ConditiODB of consiatence for three 
attributes. 

1. Ant Btatiatical iaquiiy is necesBarily confined to a certain 
time, apace, or material. An investigation on the prevalence of 
insanity, for instance, may be limited to England, to England in 
1901, to Engiieb males in 1901, or even to English malea over 60 
years of age' in 1901, and so on. 

For actual work on any given subject, no term is required to 
denote the material to which the work is so confined : the limits 
are specified, and that is sufficient. But for theoretical purposes 
some term is almost essential to avoid circumlocution. The ex- 
pression the universe of discourse, or simply the universe, used 
id this sense by writers on logic, may be adopted as familiar and 
convenient. 

2. The TtQiv^rse, like any cliUS, may be considered as specified 
by an enumeration of the attributes common to all its members, 
e.jf. to take the illustration of § 1, those implied by the predicates 
English, male, over 60 yean of age, living in 1901. It is not, in 
general, necessary to introduce a special letter into the olasa- 
eymbols to denote tbe attributes common to all members of the 
universe. We know that such attributes must exist, and the 
common symbol can be understood. 

Id etrictnesB, however, the symbol ought to be written : if, say, 
U denote the combination of attributes, English — male — over 60 
— living in 1901, A insanity, B blindness, we should strictly use 
the symbols — 

(U) =Numb«r of English mslea over 80 living in 1901, 
{VA) = „ insane English males over 60 liring in 1901, 
(UB) = „ blind 

{UAB)= ,, blind and insane English malea over SO IiviuK,i4JlB01, 

17 S'-'' 



18 THEORY OF STATISTICS. 

instead of the simpler Bjmbols iT (A) (B) (A£). Similarly, the 
general relations (2), § 1 3, Chap. I., using U to denote the common 
attributes of all the members of the universe and (U) consequently 
the total number of observations iV, should in strictness be written 
in the form — 

(V) =(UA) + iUa) = (UB) + {U0)^etc. 

= (UAB) + (UAP) + {UaB) + (CTajS) = etc. 
{UA) ={UAB) + {UAp)^(UAC) + (UAy)=eto. 
(UAB) = (UABO) + (UABy) = etc. 

3. Clearly, however, we might have used any other symbol 
instead of ^ to denote the attributes common to all tbe members 
of the uiiiveree, e.ff. A or B or AB or ABC, writing in the latter 
oase^ 

{ABC) = {ABCD) + (ABCh) 

and BO on. Hence any attribute or combirialiwi of attributes 
c(»nvion to all the elasi-K/mboh in an equation may be regarded as 
Qtecifying ike wniveru within which the equation holds good. 
Thus the equation just written may be read in words : " The 
number of objects or individuals in the universe ABC is equal to 
the number of D'a together with the number of not-Z^s within 
the same universe." The equation 

(AC) = {ABC) + (AfiC) 

may be rend : "The number of ^'h is equal to the number of A'a 
that are B together with the number of A'b that are not-j? 

within the uniiierse C." 

4. The more complex may be derived from the simpler relations 
between class-frequencies very readily by tbe process of ^eci/ying 
the itnitierse. Thus starting from the simple equation 

we have, by specifying the univeree as ji, 

.2r-iA)-(Il) + iAB). 
Specifying the universe, again, as y, we have 

{<.^y)^{y)-(Ay)^(By) + (A£y) 

= JV-(A)-{B)-(0 + (AB) + {AC) + (BC)-{ABC}. 

5. Any class-frequencies which have been or might have been 
observed within one and the same universe may be said to be 



CONSISTENCE. 19 

consisteiit with one another. They conform with one another, / 
and do not in any way conflict. 

The conditions of consistence are some of them simple, but 
others are by no means of an intuitive character. Suppose, for 
instance, the data are given — 

42 



x 


1000 


(iS) 


W 


525 


(AC) 


W 


312 


(BO 


m 


470 


(ABC) 



— there is nothing obviously wrong with the figures. Yet they 
are certainly inconsistent. They might have been observed at 
different times, in differGnt places or on different material, but 
tboy cannot have been observed in one and the same universe. 
They imply, in fact, a negative value tor (a^y) — 



= -57. 

Clearly no class-frequency can be negative. If the figures, 
consequently, are alleged to be the result of an actual inquiry in 
a definite universe, there must have been some miscount or 
misprint. 

6. Generally, then, we may say that any given class-frequencies 
■ are inconsistent if they imply negative values for any of the 

unstated frequencies. Otherwise they are consistent. To test the 
consistence of any set of 3" algebraically independent frequencies, 
for the case of n attributes, we should accordingly calculate 
the values of all the unstated frequencies, and so verify the tact 
that they are positive. This procedure may, however, be limited 
by a simple consideration. If the ultimate claBs-frequencies are 
positive, all others must be so, being derived from the ultimate 
frequencies by simple addition. Hence we need only calculate 
the values of the ultimate class- frequencies in terms of those 
given, and verity the tact that they exceed zero. 

7. As we saw in the last chapter, there are two sets of 2" 
algebraically independent frequencies of practical importance, viz. 
(1) the ultimate, (2) the positive class -frequencies. 

It follows from what we have just said that there is only one 
condition of consistence tor the ultimate trequencies, vin. that 
■they must all eiceed zero. Apart from this, any one frequency of 
the set may vary anywhere between and <» without becoming 
inconsistent with the others. 

For the positive class-frequencies, the conditions may be 



20 THEORY OF STATISTICS. 

expresaed symbolically by expanding the ultimate in terms of 
the positive frequencies, and writing each such expansion not 
less than zero. We will consider the oases of one, two, and 
three attributes in turn. 

8. If only one attribute be noted, say A, the positive frequencies 
are ff and (A). The ultimate frequencies are {A) and (a), where 

The conditions of consistence are therefore simply 
(,AnO M-<,AHO 

W ('i)H:o W Wi-ir ■ ■ • (i) 

These conditions are obvious : the number of A'n cannot be less 
than zero, nor exceed the whole number of observations. 

9. If two attributes be noted there are four ultimate frequencies 
(AB), (AQ), (oB), (afi). The following conditions are given by 
expanding each in terms of the frequencies of positive classes — 

(a) {AB)'^0 or (AB) would be negative i 

(6) (A£H(A) + {B)-ff„(ap) „ „ { ,„, 

(c) iAB)>(A) „ (Ali) „ „ ( ^^' 

(d) {AB)1i>{B) „ (oB) „ „ ) 

(a), (c), and (rf) are obvious; (b) is perhaps a little less obvious, 
and is occasionally forgotten. It is, however, of precisely the 
same type as the other three. None of these conditions are 
really of a new form, but may be derived at once from (1) (a) and 
(1) (6) by specifying the universe as 5 or as ^ respectively. The 
conditions (2) are therefore really covered by (1). 

10. But a further point arises as regards such a system of 
limits as is given by (2). The conditions (a) and (6) give lower or 
minor limits to the value of (AB) ■ (e) and (d) give upper or 
major limits. If either major limit be less than either minor limit 
the conditions are impossible, and it is necessary to see whether 
(A) and (B) can take such values that this may he the case. 

Expressing the condition that the major limits must be not less 
than the minor, we have — 

(.A)>fl WS-Jfi 

These are simply the conditions of the form (1). If, therefore, 
(A) and {£} fulfil the conditions (1), the conditions (2) mi^s^^be 



CONSISTENCK. 21 

possible. The conditions (1) and (2) therefore give all the con- 
ditions of consistence for the case of two attributes, conditions of 
an extremely simple and obvious kind. 

II. Now consider the case of three attributes. There are 
eight ultimate frequencies. Expanding the ultimate in terms of 
the positive frequencies, and expressing the condition that each 
expansion is not less than zero, we have — 







or the frequBDCy given below 








willb< 


negative. 


W iABO^O 








(ABC) 




m <(.AB)-HAC)-(A) 








(Afh) 




.) <(AS^ + (BC)-(B) 








{„By) 




(d) J[(AC) + (BO-{C) 








(.^C) 


IVi 


» >(AB) 








(ABy) 


w 


<fl >iAC) 








(A/IC) 




W K-SC) 








(.■.BO) 




(h) >iAB) + (AC) + {BC) 


-W 


-m 


-m+ 


"(.'M 





These, again, are not conditions of a new form. We leave it 
as an exercise for the student to show that they may be derived 
from (1) (a) and (1) (h) by specifying the universe in turn as 
BC, By, jiC, and /Jy. The two conditions holding in /our universes 
give the eight inequalities above. 

12. As in the last case, however, these conditions will be im- 
possible to fulfil if any one of the major limit* {e)-(h) be less than 
any one of the minor limits (a)-(d). The values on the right 
must be such aa to make no major limit less than a minor. 

There are tour major and four minor limits, or sixteen compari- 
sons in all to be made. But twelve of these, the student will 
find, only lead back to conditions of the form (2) for {AB), {AC), 
and (BC) respectively. The four comparisons of expansions due 
to oontrary frequencies { (a) and (A), (6) and {g), (c) and {/), {rf) 
and (e) ) alone lead to new conditions, viz. — 

{a) (AB)-,(AC) + {BC)-iiA) + (B) + {0- 
\b) {A£) + (AC)-(BC)H4 
(c) (AB)-(AC) + isC)nS) 
{d)-{A£) + {AC) + (BC):i.{C) 

13. These are conditions of a wholly new type, not derivable 
in any way from those given under (1) and (2). They are con- 
ditions for the consistence of the second-order frequencies urith 
each other, whilst the inequalities of the form (2) are only conditions 
for the consistence of the second-order frequencies with those of 
lower orders. Given any two of the second-order frequencies, e.ff. 



22 THEORY OF STATISTICS. 

(AB) and (AC), the conditione (4) give limits for the third, viz. 
{SC). They thus replace, far statistical purposes, the ordinary 
rules of Byllt^istic inference. From data of the syllogistic form, 
they would, of course, lead to the same conclusion, though in a 
somewhat cumhroua fashion ; one or two cases are suggested as 
eiercises for the student (Questions 6 and 7). The following 
will serve as illustrations of the statistical , uses of the con- 
ditions : — 

Example i.— Given that (A)^(B) = (C)-^JV and 80 per cent. 
of the A'a are B, 75 per cent, of A'b are C, find the limits to the 
percentage of B's that are C. The data are — 



2(^g)_n.B 2(JC) 



-:0-75 



and the conditions give- 



(«) W^i _o-8 -0-75 

(A) <t0-8 + 0-75-l 

(e) >1 -0-8 +0-75 

(d) :t»l +0'8 -0-75 



(a) gives a negative limit and (rf) a limit greater than unity ; 
hence they may be disregarded. From (6) and (e) we have — 

— that is to say, not leas than 55 per cent, nor more than 95 per 
cent, of the B'a can be C. 

Examjile ii. — If a report give the following frequencies as 
actually observed, show that there must be a misprint or mistake 
of some sort, and that possibly the misprint consists in the 
dropping of a 1 before the 85 given as the frequency {BC). 

iVlOOO 

(A) 510 (AB) 189 

(B) 490 (AC) 140 

(C) 427 (BC) 85 

From (4) (a) we have — 

(£(7)<t:510 + 490 + 427 - IDOO - 189-140 
<98. 

But 85<98, therefore it cannot be the correct value of (BC)~ 
It we read 185 for 85 all the conditions are fulfilled. 



C0N8I8TKNCK. 23 

Example iii. — lu a certain set of 1000 observationB {^) = 45, 
(£) = 23, (C) = 14. Show that whatever the perceatagea of B'% 
that are A and of C'a that are A, it cannot be inferred that any B'a 
are C 

The conditions {a) and (6) give the lower limit of {BC), which 
is required. We find — 



<•) TT * - TT TT ""• 



The first limit is clearlj negative. The second must also be 
negative, since {AB)IN cannot exceed "023 nor {AC)jN -014. 
Hence we cannot conclude that there is any limit to (BG) greater 
than 0. This result is indeed immediately obvious when we 
consider that, even if all the B'% were A, and of the remainmg 
22 A's 14 were Cs, there would still be 8 A'% that were neither 
B nor C. 

14. The student should note the result of the last example, as it 
illustrates the sort of result at which one may often arrive by 
applying the conditions (4) to practical statistics. For given 
values of N, {A), (B), (C), (AB), and (AC), it will often happen 
that am/ value of (BC) not leas than zero (or, more generally, not 
less than either of the lower limits (2) (a) and (2) (b) ) will satisfy 
the conditions (4), and hence no true inference of a lower limit is 
possible. The argument of the type "So many A'a are B and 
so many B's are C that we must expect some A'a to be C " must 
be used with caution. 



REFERENCES. 

(1) MoBOAN, A, DB, Formal Lo^, 1847 (lihapter viu,, "On the Numericslly 

D«SniteS;llof^sm"). 

(2) BooLB, G., Lavmof Thought, 1864 (chapter xii,, " Of Statistical Condi- 

The above are the claBsical works with respect in the fteaeral theory 
of nomerical conaiatence. The atadent will bnd both difficult to follow 
on account of their special notitioQ, and, in the case of Boole's work, 
the »)ecial method employed. 

(3) TuLB, G. TJ., "On the Theory of Consietence of Logical Class -frequencies 

and its Geometrical Representation," Phil. Trans., A, vol. cicvii. 
(1801), p. 81. (Deals It length with the theory of consistence tor 
any numher of attributes, using the notation of the present chapters.) 



n,gN..(jNGoogle 



THEORY OP STATISTICS. 



1. (For this and similar estimates e/. "Report b; Miss Collet on the 
StatistiOBof EmploymentofWomenuid Oirls" [C— 7664] 1891). If, in the 
urban district of Bur;, S17 per thousand of the women between 20 and 25 
years of age were returned as "occupied" at the census of 1891, and 263 per 
thousand as married or widowed, what is the loweit proportion per thousand 
of the married or widowed that must have heen occupied t 

2. If, in a series of houses actuaH; invaded hj small-poi, 70 per cent, of the 
inhabitants are attacked and 8G per cont. have heen vaccinated, what is the 
lowest percentage of the vaccinated that must have been attacked 1 

3. Qiren that GO per cent, of the inmates of a workhouse are men, 60 per 
cent are "aged "(over 60), 80 per cent, non- able- bodied, 3G per cent, aged 

-able-bodied men. and 12 per cent non-able.bodied and 
id least possible proportions of n on -able-bodied aged 

i. (Material from ref. 5 of Chap. I.) The following are the praportiona 
per 10,000 of boys observed, with certain classes of def^ts amongst a number 
of school-children. jj = development deftets, .?=nerve signs, £)=mental 
dulueas. 

JV =10,000 (D) =78B 

(A)^ 877 (^B) = 338 

(B)= 1,086 (BD) = ib5 

Show that some dull boys do not eihibit development defects, and state how 
many at least do not do su. 
£>. The following are the corresponding figures for girls : — 

A' =10,000 (i)) =689 

(A)= 582 (^B) = 248 

(B)= 860 (BO} = 363 

Show that same defectively developed girls are not dull, and state how many 
at least must be so. 

6. Take the syllogism " All A'a are B, all B's are C, therefore all A'a 

C," eipreastb ■ - ■ - - ■ .... 

and deduce thi 

7. Do the SB 
no .I'sareC." 

8. Given that {A) = (B) = {C) = \N , and that {ABM^={AU)IS=p, find 
what must be the greatest or least values ofy in order that we may infer 
that (BC)/y eiaeeda any given value, saj g. 

9. Show that if 

M)=^ i^)-^ (0^3^ 

.„A (AB)_(AO_(BC) 



the value of neither x ni 



n,gN..(jNGoogle 



CHAPTER III. 

ASSOCIAIION. 

1-4. The oriterion ofindepeDdenca. — fi-10. The conception of association and 
testing for the a»me by the comparison it percentages— 1 1-12. 
Nnmerical equality of the diBeroQces between the four second-order 
frequeociea and their independence values — 13. The coefficient of 
association —14. Necessitj tor &□ investigation into the causation of 
an attribute A being sztended to include non-.^'a. 

1. If there is no sort of relationship, of any kind, between two 
attributes A and B, we expect to find the eame proportion of A'a 
amongBt the B's as amongat the non-£'a. We may anticipate, 
for instance, the same proportion of abnormally wet seasons in 
leap years as in ordinary years, the same proportion of male to 
total births when the moon is waiing as when it is waning, the 
Bame proportion of heads whether a coin be tossed with the right 
hand or the left. 

Two such unrelated attributes may be termed independent, and 
we have accordingly as the criterion of independence for A and B — 

If this relation hold good, the corresponding relations 
(.S)_{^) 

w m 

(Ag_(«B) 
W (») 

(1) " w ■ 

must also hold. For it follows at once from (1) that — 

IB)-{AB) (P)~iA0) 



■,Gt)Ogle 



THEOBY OF STATISTICS. 
<«) (ft' 



and the other two identities may be similarly d 

3. The criterion may, however, be put into a somewhat 
different and theoretically more convenient form. The equation 
(1) expresBCB {AS) in terms of (B), (j8), and a second-order fre- 
quency {Afi) ; eliminating this second-order frequency we have — 

(AS)JAB} + {Ali)JA) 
m ~ W + (» f 
i.e. in words, " the proportion of A's amongst the B's is the same 
08 in the universe at large." The student should learn to recog- 
nise this equation at sight in any of the forms — 

(AB)_{A) • 

(S) jf ' ' 

{A) If ™ 

- V W 



m 



The equation (d) gives the important fundamental rule : 1/ the 
attrihutet A and B are independent, the pr<^ortion of AB'a in the 
universe te equal to the proportion of A'a multiplied by the propor- 
tion of B's. 

The advants^e of the forms (2) over the form (1) is that they 
give expressions tor the second-order frequency in terms of the 
frequencies of the first-order and the whole number of observa- 
tions alone ; the form (1) does not. 

Example i.— If there are 144 A'a and 384 B'& in 1024 observa- 
tions, how many AB'a will there be, A and B being independent t 
144 X 384 _ . 

There will therefore be 54 AB'». 

Example ii, — If the A'a are 60 per cent,, the B'sZ5 percent., of 
the whole number of observations, what must be the percentage 
of AB'a in order that we may conclude that A and S are 



60x35 



= 21, 



N Google 



ASSOCUTION. 27 

and therefore there must be 21 per cent, (more or less closely, ej. 
^ 7, 8 below) of AB'9 in the universe to justify the concluBion 
that A and B are independent 

3. It follows from § 1 that if the relation (2) holds for any one 
of the four second-order frequencies, e.g. {_AE), similar relations 
must hold for the remaining three. Thus we have directly 
from (1)— 

(/J) "(^) + (^) #' 
giving 



M) (o^)^ («/() + (.■/})_ (a) 
(-S) (^) (5) + (/*)" N' 



(aB) = ^^', („0): 



= (_")(^) 



— In Example i. above, what would be the number 
of ajS's, A and B being independent? 

{a) = 102i-144 = 880 

(^) = 1034 -384 -640 

880x640 .„ 

■■ <"^>= 1024— = ^^'^- 

The theorem is an important one, and the result may be 
deduced more directly from first principles, replacing {_AB) by 
its value {A){B)/Jf in the expansions — 

(aB)=(B)-(AB). 
(A^)={A)-(A£). 
{ali)={2f}-(A)-{B) + (AB). 

This is left aa an exercise for the student. 

4, Finally, the criterion of independence may be expressed in 
yet a third form, viz. in terms of the second-order frequencies 
alone. If A and B are independent, it follows at once from 
equation (2) and the work of the preceding section that — 



And evidently {aB){Ap) is equal to the same fraction. ioqIc 



28 THEORY OF STATISTICS. 

Therefore^ 

(A£) _ (Afi) \ 

<«*) (•« *"[ . . . (3) 

JAM) _ iaB) 

(All) - (./J) WJ 

The equation (6) may be re«i "The ratio of A', to .'. amonget 
S'lMlarl ""'° °' ^'' *" "'* •"'»»8" *• /S'«," aod 

This form of criterion is a convenient one it all the four 
reeond^order frequenci.. are given, enabling on. to recognise 

fnd^eUe'nt* """" """""" " °°' ""' '" >«"l.««- - 
Emmpk iv— If the second-order frequencies have the foUowinir 
values, are A and B mdependent or not 1 

US) -110 (aiJ).90 (All). 2m (.ffl.510. 
Olearlj (.^-SX.ft > (o^X.ft, 

80 A and iJ are not independent. 

6. Suppose now that A and i are not independent, but related 
m Bome way or other, however complicated. 

Then if (AB)>i^^ 

J and S are said to b. positively associated, or sometimes simolv 
associated. If, on the other hand, ^^ 

(AB)< 



(A)(3) 
If • 



ii^olM. "''"° ** "^"'ely associated or, mo,» briefly. 
The student should notice that these words are not used 
eiactly m their ordinary senses, but m a technical sense When 
AjMB are nid to be usociated, it is not meant merely that 
mm A s or. B'^ but that tht nwmba- of A; nhich a„ B\ L,^ 
tlu'Uimberlobt„j«,mi/Aa,dBar,i,ukp,^d,M. Similarlv 
when A and B are said to be negatively associated or disaaioeiated! 
It 1. not meant that », A', are S'.. but that «. n.,^ o/A^ 
»4k4 are B ,f<Ul, .tort of the ,m,Ar to b, ap«M if A mH 



ASSOCIATION, 29 

are indfpendent. "Association" cannot be inferred from the 
mere fact that iome A'a are B'a, however .great that proportion ; 
this principle is fundamental, and should be alwajs borne 
in mind. 

6. The greatest possible value of {AS) for given values of 
S', {A), and {B) is either (A) or {B) (whichever is the less). When 
{AB) attains either of these values, A and B may be said to be 
compUtelj/ or perfectly/ associated. The lowest possible value of 
(AB), on the other hand, is either zero or (A) + (S) - JV (which- 
ever is the greater). When {A£) falls to either of these values, 
A and B may be said to be eompletelt/ disassociated. Complete 
association is generally understood to correspond to one or other 
of the cases, " All A'a are B " or " All B's are A," or it might he 
more narrowly defined as corresponding only to the case when 
both these statements were true. Complete disassocistion may 
be similarly taken as corresponding to one or other of the cases. 
" No A'a are B," or "no a's are 0," or more narrowly to the 
case when both these statements are true. The greater the 
divergence of (AB) from the value (A)(B)/^ towards the limit- 
ing value in either direction, the greater, we may say, is the 
intetintf/ of association or of disassociation, so that we may speak 
of attributes being more or less, highly or slightly associated. This 
conception of degrees of association, degrees which may in fact be 
measured by certain formulre {cf. % 13), is important. 

7, When the association is very slight, i.e. where {AE) only 
differs from (A)(B)/^ hy a few unit« or by a small proportion, it 
may be that such association is not really significant of any 
definite relationship. To give an illustration, suppose that a coin 
is tossed a number of times, and the tosses noted in pairs ; then 
100 pairs may give such results as the following (taken from an 
actual record) : — 

First toss beads arid second heads . .26 

„ „ „ tails . . .18 

First toss tails and second heads . . .27 

„ „ „ tails . . .29 

If we use A to denote " heads " in the first toss, B " heads " in 
the second, we have from the above (A) = 44, (B) = 53. Hence 

(J)(5)/jr = ^^^ = 23-32, while actually (AB) is 26. Hence 

, there is a positive association, in the given record, between 
the result of the first throw and the result of the second. But it 
ia fairly certain, from the nature of the case, that such association 
cannot indicate any real connection between the results of the 



30 THEORY OP BTATISTIC8. 

two throws; it must therefore be due merely to such a complex 
Bjrstem oC causes, impossible to analyse, as leads, for example, to 
differences between small samples drawn from the same material. 
The conclusion is confirmed by the fact that, of a number of such 
records, some give a positive association (like the above), but 
some a negative association. 

8. An event due, like the above occurrence of positive associa- 
tion, to an extremely complex system of causes of the general nature 
of which we are aware, but of the detailed operation of which we 
are ignorant, is sometimes said to be due to chance, or better to 
the duuices or fluctuations of sampling. 

A little consideration will suggest that such associations due to 
the fluctuations of sampling must be met with in all classes of 
statistics. To quote, for instance, from % 1, the two illustrations 
there given of independent attributes, we know that in any 
actuai record we would not be likely to find exacUy the same 
proportion of abnormally wet seasons in leap years as in ordinary 
years, nor exacUy the same proportion of male births when the 
moon is waxing as when it is waning. But so long as the diver- 
gence from independence is not well-marked we must regard such 
attributes as practically independent, or dependence as at least 
unproved. . 

The discussion of the question, how great the divergence must 
be before we can consider it as " well-marked," must be postponed 
to the chaptera dealing with the theory of sampling. At present the 
attention of the student can only be directed to the existence of 
the difficulty, and to the serious risk of interpreting a "chance 
association" as physically significant. 

9. The definition of § 5 suggests that we are to test the 
esistence or the intensity of association between two attributes 
by a comparison of the actual value of (AE) with its independence- 
value (as it may be termed) (A){B)IN. The procedure is from the 
theoretical standpoint perhaps the most natural, but it is usual, 
in practice, to adopt a method of comparing proportumt, e.g. the 
proportion of A'& amongst the B's with the proportion in the 
universe at large. Such proportions are usually expressed in the 
form of percentages or proportions per thousand. 

A large number of such comparisons are available for the 
purpose, as indicated by the inequalities (4) below, which all 
hold good for the case of potiiive association between A and 
B. The first two, (n) and (ft), follow at once from the definition 
of g 5, (c) and (d) follow from (a) and (6), on multiplying 
across and expanding {A) and N in the first case, (B) and N 
in the second. The deduction of the remainder is left to the 
student. 



N Google 





ASSOCIATION. 


'-^>>^.' 


(•) 


('I) * 


{AB)Uli) 


le\ 


(^) (a£) 


w'w 


W 


(^) * W 


^<*# 


» 




^<<^' 


w 


(iB) («) 




01 


W>f 


M) (oB) 


w 


W>'^? 


(» W 



(m) 



The question ariaes then, which is the best comparison to adopt ] 
10. Two principles should decide thia point: (1) of any two 
comparisons, that is the better which brings ont the more clearly 
the degree of association ; (2) of any two comparisons, that is 
the better which illustrates the more important aspect of the 
problem under discussion. 

The second condition will generally exclude all the comparisons 
(«)-{m), for the capital letters will naturally be used to denote 
the important aspect of the character. We will generally be 
concerned, for instance, with the proportion of A's amongst the 
B'a as compared with the ^'s {as in (c) ), and not with the propor- 
tion of the a'a in those two universes {as in (?) ) ; or with the 
proportion of A's amongst the B'a as compared with the whole 
universe (a), and not with the proportion of a's amongst the 
^'s as compared with the whole universe (j). That is simply the 
natural method of using the notation. We may confine our 
attention accordingly to the comparisons (a)-(rf). Of these 
four, {c) or (d) is generally to be preferred to {<t) or (6), for the 
reason that either of the latter may give a misleading impression 
as to the intensity of the association. We have in fact — 

(A)_{AB) (B}JAB W 

If (B) ' Jf'^ iU) " JV" 
Hence it {£)/Jf be large compared with (j8)/i\^, {A)/JV will 
approach the value {AB)j(B) and the association will appear 
to be very small, even though {AB)I{B) and {A^)l{fi) differ 
considerably. Suppose, for example, in some given case, tor a 
considerable number of observations — 



{AB)j(B-) = -10 



(AM^) = -40 



, Goo»^lc 



61 THEORY OF STATISTICS. 

this would mean a considerable positive aseociation between 
A and B. But if it were only stated that — 

{.AB)I{B) = -70 {A)IN^ -67 

the association would appear to be amall. Yet the two state- 
ments are equivalent if (S)/Jr=0'9, for then we have— 

(^)/iV=-7x-9 + -4x-l = -67 

The meaning of (a) or (6), in fact, cannot be fully realised 

unless the value of {S)/Jr (or {A)/JV in the second case) is known, 
and therefore (c) is to be preferred to (a), and (d) to (6). An 
exception may, however, be made in cases where the proportion 
of B'a (or A'a) in the universe la very small, so that iA)/N' 
appraachea closely to {Ap)/(fi} or (Byif to (aB)j{a) {cf. Example 
vi. below). 

There still remains the choice between (a.) and (6), or between 
(c) and {d). This must be decided with reference to the second 
principle, i.e. with regard to the more important aspect of the 
problem under discusaion, the exact question to be answered, 
or the hypothesis to be tested, as illustrated by the examples 
below. Where no definite question has to be answered or 
hypiothesis tested both pairs of proportions may be tabulated, 
as in E^mple vi. again. 

Example v. — Association between sex and death, (Material 
from 64th Annual Report Reg. General. [Cd. 1230] 1903.) 

Males in England and Walea, 1901 . . 15,773,000 
Females „ „ „ • . 16,848,000 

285,618 
265,967 

We may denote the number of males by {A\ the number of 
deaths by (S) ; then the natural comparison is between (AB)/(A) 
and {a.B)l{a), i.e. the proportion of males that died and the 
proportion of females. We find — 

(AB) 285,618 f 

(A) "15,773,000" 
^aS) 265,967 
W "16,848,000" "'°°- 

Therefore (AS)j{A)>(aS)/{a), and there is positive association 
between male-eex and deatk. It is usnal to express piwportiona 



AaSOCIATION. 33 

of daathe, births, marriages, etc., to the population aa rates per 
thousand ; so that the above figures ivould be written— 

181 per thousand. 

15-8 

A comparison of the death-rate among males with the death- 
rate for the whole population would be equally valid, but it 
should be remembered that the latter depends on the sex-ratio 
as well as on the causes that determine the death-rates amongst 
males and Females. The above figures give — 

Death-rate amoi^ males 

„ for whole population 

This brings out the difference between the death-rates of 
males and of the whole popviation, but is not so clear an indica- 
tion of the difference between males and femalet, which is the 
point to be investigated. 

A comparison of the form (4) (c) is again valid for testing the 
association, but the form is not desirable, illustrating verj well 
the remarks on the opposite page. Statisticians are concerned 
with death-rates, and not with the sez-ratios of the living and 
the dead. The student should learn, however, to recc^ise such 
forms of statement as the following, as equivalent to the above : — 

^T^l,'- T'"" "°°°''' """" [ 618 per thou»nd. 
that died m the year . . . | ^ 

Proportion of males amongst those I ^gg 

that did not die in the year . ) " 

Since (AB)!(B)>{A^)j(g), it follows, as before, that there is 
positive association between A and B. 

Example vi. — Deaf-mutism and Imbecility. (Material from 
Census of 1901. Summary Tables. [Cd. 1523.]) 

Total population of England and Wales . . 32,528,000 
Number of the imbecile {or feeble-minded) . 48,882 

Number of deaf-mutes ..... 15,246 

Number of imbecile deaf-mutes . . . 451 

Required, to find whether deaf-mutism is associated with 
imbecility. 

We may denote the number of the imbecile by {A), of deaf- 
mutes by {B). One of the comparisons (a) or (6) may very well 
be used in this case, seeing tlmt {A)IN and {B)IN differ very 
little from (Ap)l(p) and {aB)j(a) respectively. Tl|e .q^^en 

3 <^'' 



34 THEORY OF STATlaTrCS. 

whether to give the preference to (a) or to (6) depends on the 
nature of the investigation we wish to make. If it is desired to 
exhibit the conditions among deaf-mutes (a) may be used : — 

Proportion of imbeciles in the whole I , ,- 

population = (.^)/jV . . , ( " 

If, on the other hand, it is desired to eshibit the conditions 
amongst the imbecile, (b) will be preferable. 

Proportion of deaf-mutes amongst Inn ii. j 

the imberil. {AB)/(A} . * ,\^^P" thou«nd. 

Proportion of deaf-mutes in the I ^ - 
whole population {B)/If . . }^^ 

Either comparison exhibita very clearly the high degree of asso- 
ciation between the attributes. It may be pointed out, however, 
that census data as to such infirmities are very untrustworthy. 

Example vii.^ Eye-colour of father and son (material due 
to Sir Francis Galton, as given by Professor Karl Pearson, Phil. 
Tram., A, vol. cicv. (1900), p. 138; the classes 1, 2, and 3 of the 
memoir treated as light). 

Fathers with light eyes and sons with light eyes (AB) . 471 

not light „ {Afi) . 161 

„ not light „ light „ (a£) . 148 

„ „ „ not light „ (a^) . 230 

Required to find whether the colour of the son's eyes is 

In cases of this liind the 

e.g. a family in which the 

father was light^yed, two sons light-eyed and one not, would be 

reckoned as giving two to the class A£ and one to the class J p. 

The best comparison here is — 

Percentage of light-eyed amongst the sons ) -- , 

of light-eyed fathers . . _ . _ | If per cent. 

Percentage of light-eyed amongst the sons 1 „a 
of not-light-eyed fathers . , . ) " 

But the following is equally valid — 

Percentage of light-eyed amongst the \~o 

fathers of light*yed sons . . J "> per «nt. 

Percentage of lighteyed amongst the I , „ 

fathers of not-light-eyed sons . .) ,~ Vnoir 



ASSOCIATION. 35 

The reason why the fonner comparison is preferable is, that we 
usually wish to estimate the character of offspring from that of 
the parents, and detiDe heredity in terms of the resemblanoe of 
offspring to parents. We do not, aa a rule, want to make use of 
the power of estimating the character of parents from that of their 
offspring, nor do we define heredity in terms of the resemblance 
of parents to offspring. Both modes of statement, however, 
indicate equally clearly the tendency to resemblance between 
father and son. 

11. The values that the four second-order frequencies take in 
the case of independence, via. — 

(Am (ox^ (Am {^m 

are of such great theoretical importance, and of so much use as 
reference-values for comparing with the actual values of the 
frequencies (AB) (aB) (A^) and (ap), that it is often desirable to 
employ single symbols to denote them. We shall use the symbols — 

If S denote the eicees of (AB) over (vf/f)^, then we have — 
(oB) - (-B) -(AS)- (3) - (AB), - « 
[if-(A)m 
If ' 

-M)o-8- 
.-. (AII)-(AB),.(„3),-(,^. 
Similarly it may be shown that — 

U« = M,'J),-8. 

Therefore, quite generally we have— 

(AB) - (AS), . (.» - (a(J). - (A(l), - (J« - {«£). - (oB). 

Supposing, for example, 

Jf=100 (^) = 60 (5) = 45 
then 

(AB\='il (a5)o=18 {^,8), = 33 (o.p\^2^^ 



36 THKOHT OP STATISTICS. 

If, now, A and B are positively asaociated, and (AB) = ^y 35, 
then (oB) = i5 - 35 = 10, (J/S) = 60 - 35 = 25, (afi) =100-60-45 
+ 35 = 30, and we have— 

35 - 27 =30 - 22 - 18 - 10 = 33 - 25 = 8. 

Similarly, it A and B be disassociated and (AB) = say 19, the student 
will find that — 

(AB) = 19 {a£) - 26 (J/3) = 41 (a^) = 14 
and 19 - 27 - 14 - 22 = 18 - 26 = 33 - 41 = - 8. 



12. The value of this common difference S may be expressed 
a form that it is useful to note. We have by definition — 



Bring the terms on the right to & common denominator, and 
express all the frequencies of the numerator in terms of those of 
the second order ; then we have— 

iri -KJ2))+(jffl[(ji!)+(.ii)r ; 

-y{<Ji!)(.ffl-(.£XJ«}. 

That is to say, the common difference is equal to l/JV"th of the 
difference of the " cross products " {A£)(a^} and {aB){Ap) ; e.g, 
taking the examples of § 11, we have 



and ^ = 1^1 19>'14-26x41 I = -8. 

It is evident that the difference of the cross-products may be 
very large if N be large, although S is really very small. In 
using the difference of the cross-products to test mentally the 
sign of the association in a case where all the four second-order 
frequencies are given, this should be remembered : the difference 
should be compared with JV^, or it will be liable to suggest a higher 
degree of association than actually exists. 

Eieaniple viii. — The following <^ta were observed for bybndq of 



ASSOCIATION. 37 

DtUiuara (W. Bateaon and Uiee Saunders, Report bo the Evolution 
Committee of the Royal Society, 1902) : — 

Flowers violet, fraita prickly {AB) . 47 

„ „ smooth {Afi) . .12 

Flowers white, „ prickly ^aj5) . . 21 

„ „ smooth (afi) . . 3 

Investigate the association between colour of flower and char- 
acter of fruit. 

Since 3x47-Ul, 12x21=252, i.e. (AB) {<i^)<(aB) (A^), 
there is clearly a negative association; 252-141 = 111, and at 
first sight this considerable difference is apt to suggest a consider- 
able association. But S= 111/83 = 1-3 only, so that in point of 
fact the association is small, so small that no stress can be laid 
on it as indicating anything but a fluctuation of sampling. 
Working out the percentages we have — 



Percentage of white-flowered plants with | 
prickly fmita 



8T 



13. While the methods used in the preceding pages suffice for 
most practical purposes, it is often very convenient to measure 
the intensities of association in difTerent cases by means of some 
formula or " coefficient," so devised as to be zero when the attributes 
are independent, + 1 when they are completely associated, and 
— 1 when they are completely disassociated, in the sense of § 6. If 
we use the term "complete association " in the wider sense there 
defined, we have, grouping the frequencies in a small table in a 
way that is sometimes convenient, the three cases of complete 
association : — 





01 








m 








(S) 




(AB) 





W) 




{At) 


(AB) 


{A) 




(AB) 





{A) 


i^ 


i«Bi 


(a) 





W, 


(») 





(«^) 


(a) 


(S) 


(« 


N 


(« 


w 


JV- 


{S) 


CiB) 


N 



In the first case all ^'s are B, and so (A^) = ; ii 
all S's are A and so (aB) = U ; and in the third case w 



the second 
I have (A) — 



THEORY OP aiATISTICB. 



{B) = {AB), SO that all ^'b are B and also all B'b are A. The 
three corresponding cases ot complete disassociation are— 






MS) 


M) 


{.X) 


(•B) 


(.) 


m 


(Bl 


Zf 



lAB) 


U0) 


M) 


(•«) 





W 


IB) 


(B) 


N 



■ 


(-*« 


<<<) 


(afi) 





(") 


(S) 


(fl) 


Jf 



It is required to devise some formula which shall give the value 
+ 1 in the first three caees, - 1 in the second three, and shall also 
be zero where the attributes are independent. Many such 
formulee may be devised, but perhaps the simplest possible is the 
espression— 

M 



—where 8 is the symbol used in the two last sectiona for the 
difference (AB) - (AB)^. It is evident that Q is zero when the 
attributes are independent, for then S is zero : it takes the value + 1 
when there is complete association, for then the second term in 
both numerator and denominator of the first form of the expression 
is zero ; similarly It is — 1 where there is complete disassociation, 
for then the first term in both numerator and denominator is 
zero. Q may accordingly be termed a coe^icient of aseocuition. 
As illustrations of the values it will take in certain cases, the 
association between deaf<mutism and imbecility, on the basis of 
the English census figures (Example vi.) is +0'91 ; between light 
eye colour in father and in son (Example vii.) +066; between 
colour of flower and prickliuess of fruit in Datura (Example viii.) 
— 0-28, an association which, however, as already stated, is 
probably of no practical significance and due to mere fluctua- 
tions of sampling. 

The coefficient is only mentioned here to direct the attention 
of the student to the posaibility of forming such a measure of 
association, a measure which serves a similar purpose in the case 
of attributes to that served by certain other coefficients in the casea 
ot manifold classification (c/. Chap. V.) and of variablpa ^c/. 



ASSOOIATIOS. 39 

Chap. IX., and the references to Chaps, X. and XVI.). For 
further illustrations of the use of this coefficient the reader ie 
referred to the reference (1) at the end of this chapter ; and for a 
mode of deducing another coefficient, based on theorems in the 
theory of variables, which has come into more general use, to ref. 
(3). Reference should also be made to g 10 of Chap. XI. 

11. In concluding this chapter, it may be well to repeat, for the 
sake of emphasis, that (cf. % 5) the mere fact of 80, 90, or 99 per 
cent, of A'& being B implies nothing as to the association of A 
with B ; in the absence of information, we can but assume that 
80, 90, or 99 per cent, of a's may also be B. In order to apply 
the criterion of independence for two attributes A and B, it is | 
necessary to have information conoeming a's and jS's as well aa 
A't. and £'b, or concerning a universe that includes both a's and 
A\ 0'8 and B'b. Hence an investigation as to the causal 
relations of an attribute A must not be confined to A'b, but must 
be extended to a's (unle^ of courae, the necessary information 
OS to a's is already obtainable) : no eompariton is otherwise 
possible. It would be no use to obtain with great pains the 
result {cf. Example vi.), that 29-6 per thousand of deaf-mutm 
were imbecile unless we knew that the proportion of imbeciles 
in the whole population was only 1 '5 per thousand ; nor would 
it contribute anything to our knowledge of the heredity of deaf- 
mutism to find out the proportion of deaf-mutes amongst the 
offspring of deaf-mutes unless the proportions amongst the off- 
spring of normal individuals were also investigated or known. 



REFERENCES. 

(1) Yolk, G. U., "On the As80oi»tion of Attributes in Statistics," FhU. 

Tram. Boy. Soc, Series A, vol. cieiv., 1900, p. 267. (Deals fullj 
with the theory of sssoci&tion ; the assncistion coefficient of § 13 

(2) Ydlk, O.V., " Notes on the Theory of AsBocifttion of AttributflH in Statis- 

ticB," Bunnetrika, vol. ii., IBOS, p. 121. (Contains an abstract of the 
pTiiicip«l portions of (1) and other matter. ) 

(3) PlAnaoN, Kael, "On the Correlation of Characters not Quantitatively 

Measurable," Phil. Trans. Bay. Soc, Seriea A, vol. oicv., ISOO, p. 1, 
(Deals with the problem of measurement of intensity of association 
from the stanilpnint of the theory of variables, giving a method which 
has since been largely used : only the advanced student will fe able to 
follow the work.) 

(4) LlPFS, G. F. , " Die Bestimmung der Abhangigkeit zwischsn denMerkmalen 

eines Gegenstandes," £«ru:A/« (J, maik.-pkyi. Klasted, kgl. SdehHschen 
Geaellacha/l d. (Fistejiachaflea ; Leipzig, Feb. 1605. (Deals with the 
general theory of the dejiendence between two characters, howevar 
classified : the coefficient of association of % 13 is again suggested inde- 
pendently.) t,)y|e 



THEORY OF STATISTICS. 



EXEECISES. 



1. At the censiiB of EngUnd ond Wales in 1P01 tijere were (to the nearea't 
lOOO) IB, 729,000 males and 16,799,000 females; 3497 males were reCunied 
as dettf-mutea from childhood, aud 3072 females. 

State proportiatia eihibiting the sasociatioD between deaf-mutism from 
childhooa and sex. How man; of each sex for the wme total number would 
have been deaf-mutes if there had been no asaociation 1 

2. Show, as briefl; as possible, whether A and B are independent, 
positively associated, or negatively associated in each of the fallowing cases : — 



A' =6000 U) =2350 
[A) = 490 {AB)= 2B4 
{AB)- 266 (■«)= 768 



{B) =8100 (AB) = -im<i 
(«) = 670 (aB) = 880 
Ue)= 18 («3)= Ui 



8. (Kgurea derived from Darwin's CVosJ- amd Self-fertilisation, of Plants," 
ef. ref. 1, p. 294.) The table below gives the numbera of plants of certain 
species that were above or below the average height, stating separately those 
that were derived from cross-fertilised and from self-fertilised parentage. 
Investigate the association between height and cross- fertilisation of patent- 
age, and draw attention to any specie! points you notice. 



Spodes. 


Parentage Gross ftr- 
aiiaed. Heights 


Parentage Seli-fer- 1 
tilised. Height- 


Above 


Below 
Average. 


Above 
Average. 


Below 
Average. 


Ipomaa purpurea 
Petunia violacea 
Reseda lutea . 
Keseda odorata . 
Lobelia fulgena . . . 


es 

61 
26 
89 
17 


10 
1« 

7 
IS 
17 


18 
13 
11 
25 
12 


65 
64 
21 
SO 
23 



4. (Figures from same source as Example vii. p, 84, bnt material differently 
grouped ; classes 7 and 8 of the memoir treated as " dark.") Investigate the 
association between darkness of eye -col our in father and son from the following 
data:— 



with dark eyes and son; 
vith not-d^rk eyes and 



with dark eyes 

not-dark ejes 
ions with dark eyes 



{AB). 
(AB). 
(aB) . 



Also tabnlate for comparison the Irsqi 

had there been no heredity, i.t. the vak 

5. (Figures from 



>t-darkeyes (o^) . 782 

cies that would have been obeerred 
ol(AB)„ (A3)o, et<!, {% II). 
I. ) Investigate the association between 



Husbands with not-light eyes and wi 



« with light eyes {oB} . 132 
noWighteyes (a3) . lip. 



ASSOCIATION. 



41 



Also tabulate for compftrison ttia frequencies that nould hare been obaerred 
had there been stiict independence between eye colour of bosband and e;s 
colour of wife, i.e. tlie values of {AB)^, etc., as in question i. 

B. (Figures from tbeCeTUtis of England and Wala, ISSl, vol. ili. ; the data 
cannot be regarded as tnutwertby.) The figures given below show the 
nnmber of males in successive age Kfoaps, together with the number of the 
blind (A), of the mentally-deranged {B), and the blind mantally- deranged 
(,AB). Tiace the association between blindness and mental detangement 
from childhood to old age, tabnlating the proportions of insane amongst the 
whole popnlatiau and amongst the blind, and also the asaooiation coefficient. 
Gire a snort verbal statement of yonr rmults. 





s- 


.^ 


se- 


at. 


.■^ 


fie- 


^ 


ojtt;^'. 


^ 












nom 




isiesE 


















































w 







7. Show that if 

{AB)^ ioB), IAB\ {,S\ 

iAB), {«e), (A0)^ (flfl), 

be two aggregates oorteaponding to the same values of (A), (£), (n), and {$), 

(AB),-lAB)t={^)^-laB\ = {Afi),'-iA$), = l,a0\-(aB)r 

8. Show that if 

S=:{AB)-(AS)„ 
M£j» + {<^? - («S)" - Ufl)" = [M) - WICB) - (B)] -h 2iV. J. 



n,gN..(JNGOOglC 



f ABTIAL ASSOCIATION. 

1-2. UnceTtaintj in ioterprstAtioit ot an observed association — S-G. Source of 
the ambignity : partial aesociationB — 3-8. Illnsory aaaociatiou due 
to the assootatioD of each of two attributes with a third — S. Eatima- 
tion of the partial osaodationa front the frequeocius ot the second 
order — 10-12. The total number of aBSOciationa for a given number 
of attributes — 13-14. The case of complete independence. 

1. If we find that in any given case 

all that ie known is that there is a relation ot some sort or kind 
betweeo A and B. The result by itself cannot tell as whether 
the relation is direct, whether possibly it is only due to " fluctuations 
of sampling " (c/. Chap. III. ^ V-8), or whether it is of any other 
particular kind that we may happen to have in our minds at the 
moment. Any interpretation of the meaning of the association is 
necessarily hypothetical, and the number of possible alternative 
hypotheaoa is in general considerable. 

2. The commonest of all forms of alternatiye hypothesis is of 
this kind ■ it is argued that the relation between the two attributes 
A and B is not direct, but due, in some way, to the association of 
A with C and of B with C. An illustration or two will make the 
matter clearer ; — 

(1) An association is observed between "vaccination" and 
" exemption from attack by small-pox," i.e. more of the vaccinated 
tiian of the unvaccinated are exempt from attack. It is argued 
that this does not imply a protective effect of vaccination, but is 
wholly due to the fact that most of the unvaccinated are drawn from 
the lowest classes, living in very unhygienic conditions. Denoting 
vaeeiTiationhy A,ex^npfionfrom attackhy B^ hygienic ctrnditiont by 
C, the argument is that the observed association between A and B 
is due to the associations of both with C. ,-- , 

42 ..i.Cooglc 



PAKTIAL ASSOCIATION, 43 

(2) It ia observed, at a general election, that a greater 
proportion of the candidates who spent more money than their 
opponents won their elections than of those who spent less. It 
ia argued that this does not mean an influence of expenditure on 
the result of electioca, but is due to the fact that Conservative 
principles generally carried the day, and that the Conservatives 
generally spent more than the Liberals. Denoting loinmn^by A, 
ap^idinff more than the opponent by B, and Contervative by C, the 
ailment is the same as the above {ef. Question 9 at the end of 
the chapter). 

(3) An association is observed between the presence of some 
attribute in the father and its presence in the son ; and also 
between the presence of the attribute in the grandfather and its 
presence in the grandson. Denoting the presence of the attribute 
in son, father, and grandfather by A, B, and C, the question arises 
whether the association between A and C may not be due solely 
to the associations between A and £, B and C, respectively. 

3. The ambiguity in such cases evidently arises from the fact 
that the universe of observation, in each case, contains not 
merely objects posaeasing the third attribute alone, or objects 
not prasessing it, but both. 

If the universe were restricted to either clasi alone the given 
ambiguity would not arise, though of course others might remain. 

Thus, in the first illustration, if the statistics of vaccination 
and attack were drawn from one narrow section of the population 
living under approximately the same hygienic conditions, and an 
association were still observed between vaccination and exemption 
from attack, the supposed argument would be refuted. The fa^st 
would prove that the association between vaccination and 
exemption could not be wholly due to the association of both with 
hygienic conditiom. 

Again, in the second illustration, if we confine our attention to 
the " universe " of Conservatives (instead of dealing with candidates 
of both parties together), and compare the percentages of Conserva- 
tives winning elections when they spend more than their opponents 
and when they spend leas, we ^all avoid the possible fallacy. If 
the percentage is greater in tbo former case than in the latter, it 
cannot be for the reasons suggested in § 2. 

The biological case of the third illustration should be similarly 
treated. If the aasociation between A and C be observed for 
those cases in which all the parents, say, possess the attribute, or 
else all do not, and it is still sensible, then the association first 
observed between A and C for the whole universe cannot have 
been due solely to the observed associations between A and B, B 

"'<"^- ,,, Google 



44 THEORY OF STATISTICS. 

4. The asaociationa observed between the attributes A and B 
in the univerae of Cs and the universe ot -/a may be termed 
partial aseociationa, to distinguish them from the total assooiatiotts 
observed between A and B in the universe at lai^e. In terms of 
the definition of § 5 of Chap. III., A and B will be said to be posi- 
tively associated in the universe of C's (c/. § 4 of Chap. II.) when 

(A3C»^^-^ .... (1) 

and u^;atively associated in the converse case. 

As in tlie simpler caae, the association is moat simply tested by 
a comparison of percentages or proportions {§ 9, Chap. III.), 
although for aome purposes the " coefficient of association " may be 
useful. Confining our attention to the more fundamental method, 
if A and £ are positively associated within the universe of C's, we 
must have, to quote only the four most convenient comparisons 
(cf. (4) (a)-{d)i Chap. III. p. 31), 

(ABC) (AC) ,. (^^ (BC) 



(BO) " {€) ^ ' {AG) ^ (C) ^' I 



{ABC) (A^ {ABC) {aBC) 

{BC) ^ (ySC) ^' (AC) ^ (a.C) 



m 



(2) 



These inequalities may easily be rewritten for any other case by 
making the proper substitutions in the symbols ; thus to obtain 
the inequalities for testing the aaaociation between A and C in 
the universe of 5's, B must be written for C, j8 for y, and vice 
vertd, throughout; it being remembered that the order of the 
letters in the class-symbol is immaterial. The remarks of § 10, 
Chap. III., as to the choice of the comparison to be used, apply of 
course equally to the present case. 

5. Though we shall confine ourselves in the present work to 
the detailed discussion of the case of three attributes, it should be 
noticed that precisely similar conceptions and formulae to the 
above apply in the general case where more than three attributes 
have been noted, or whore the relations of more than three have 
to be taken into account. If, when it is observed that A and B 
are still associated within the universe of C\ it is argued that 
this is due to the association of both A and .5 with D, the argu- 
ment may be tested by still further limiting the field of observa- 
tion to the universe CD. If 

{ACD){BCI>) 
{CD) ' 

A and B are positively associated within the universe of CD's, 
and the association cannot be wholly ascribed to the presence and 



(ABCD)> 



PARTIAL AS80CIATI0H. 45 

absence of i? as suggested, nor to tlie presence and absence of 
C and D conjointly. If it be tbea ai^ued that the presence 
and absence of M is the soarce of association, the process may 
be repeated as before, the association of A and B being tested 
for the universe CDE, and so on as far as practicable. 

Partial aitociations thus form the basis of . discussion for any 
case, however complicated. The two following examples will 
serve as illustrations for the case of three attributes. 

Example i.— (Material from ref. 5 of Chap, I.) 

The following are the proportions per 10,000 of boys observed 
with certain classes of defects, amongst a number of school 
children. {A) denotes the number with development defects, {B) 
with nerve-signs, {D) the number of the " dull." 



N 


10,000 


{AB) 


338 


W 


877 


(AD) 


338 


m 


1,086 


(M 


455 


w 


789 


(,ABI>) 


153 



The Report from which the figures are drawE concludes that " tbe 
connecting link between defects of body and mental dulnees is 
the coincident detect of brain which may be known by observation 
of abnormal nerve-signs." Discuss this conclusion. 

The phrase "Connecting link" is a little vague, but it may 
mean that the mental defects indicated by nerve-signs B may 
give rise to development-defects A, and also to mental-dul- 
ness D ; A and D being thus common effects of the same cause 
B (or another attribute necessarily indicated by B\ and not 
directly influencing each other. The case is thus similar to that 
of the first illustration of § 2 (liability to small-pox and to non- 
vaccination being held to be common effects of the same circum- 
stances), and may be similarly treated by investigation of the 
partial associations between A and D for the universes B and 0. 
As tbe ratios (A)/^, (B)jN, {D^jN are small, comparisons of the 
form (4) (a) or (6) of Chap. III. (p. 31), or (2) (a) (i) above, may 
very well be used (<;/. tbe remarks in § 10 of the same chapter, 
pp. 31-2). 

The following figures illustrate, then, the association between 
A and D for the whole universe, the 5-universe and the ^- 
universe :— 

For the entire material : — 
Proportion ofthedall={D)/itf' . . =-^^= 7-9 per cent. 



THEORY OF STATISTICS. 



For those exhibiting nerve Bigns : — 
Proportioii of the dull-(5i>/(S) . . =-^^= 
„ , , defectively developed who \ _ 1 63 _ 

y,eKdnn={JBD)HAB). . . . / " "aar " 

For those not exbibitiDg nerve signs : — 
ProportionoftheduIl = (;SO)/(3) . = "^ i- - 



The results are extremely striking ; the association between A 
and D is very high indeed both for the material as a whole (the 
universe at large) and for those not exhibiting nerve-signa (the 
j8-univerae), but it is very imall for those who do exhibit nerve- 
aigOB (the 5-uni verse). 

This result does not appear to be in accord with the conclusion 
of the Report, as we have interpreted it, for the association 
between A and D in the ^-universe should in that case have 
been very low instead of very high. 

Example ii. — Eye-colour of grandparent, parent and child. 
(Material from Sir Francis Galton's NaPmvi Inheritance (1889), 
table 20, p 216. The table only gives particulars for 78 lai^e 
families with not leas than 6 brothers or sisters, ao that the 
material is hardly entirely representAtive, but serves as a good 
illustration of the method.) The original data are treated as in 
Example vii. of the last chapter (p. 34). Denoting a light-eyed 
child by A, parent by B, grandparent by C, every possible line of 
descont ia taken into account. Thua, taking the following two 
lines of the table, 

Children Parents GmndparentB 

A. a. B. B. C. 7 

Llghl-eyed, ug^^^yed. Lisht-^-d- Llghl^^ed. tilfht-ey.d. ught^i^. 

4 5 11 13 

3 4 11 4 

the first would give 4xlxl = 4tothe class ABC, 4x1x3-= 12 to 
the class ABy, 4 to A^C, 12 to Afiy, 5 to oBC, 15 to aBy, 5 to 
a^C, and 15 to a/Sy ; the second would give 3x1x4 = 12 to the 
class ABC, 12 to AffC, 16 to aBC, 16 to a^C,a.nd none to the re- 
mainder. The class-frequencies so derived from the whole table are, 

(ABC) 1928 (a£C) 303 

(ABy) 596 (aBy) 225 

(A0C) 553 (a^C) 395 

(Apy) 508 (aft.) 



CH)ogle 



PABTUL AaSOCIATIOH. 47 

The following comparisODB indicate the associatioD between 
grandparents and parents, parents and children, and grand- 
parents and grandchildren, respectively :— 

GrandparenU and Parents. 

Proportion of ligh^eyed .moDgst the \ ^(^ = 22« ^.^. 

children of li^t-ejedgraDdparentB/ (C) 3178 '^ *V^"^^ 
FroportioD of light-ejed amougBt the^ iSy\ g21 

cnildren of not-light-eyed grand- l- = -i-y = jgg(j = 44'fl „ 

Parents and Children. 
ProportJon of light-syed amongst the 1 _ Cfl*) _ 2524 _ 

ehildren of li^t-eyed parents ■ I {B) "3062 "^ ' P*'**"^ 
Proportion of light-eyed amongst the 1 _(-^B) _1080_ 

children ofnot-light-eyedparenta. / (g) 1966 " 

In both the above cases we are really dealing with the 
aasociatiou between parent and offspring, and consequently the 
intensity of association is, as might be expected, approximately 
the same ; in the next case it is naturally lower ; — 

Grandparents a/nd Grandchildren. 

Proportton of light-ejfod amongst the \ ia<T) 2480 

grandchildren of Ijght-eyed grand- !- = 77^' = 57:^5 = 7a'0 per cent. 

parents J (t-l 3178 

Proportion of light-eyed amongat the ■) /^^i jj(i4 

grandchildren of not-light-eyed !■ = -7\-ri^ = ^'3 .. 

grandpareota . . . . J ''''' '*^° 



We prrxjeed now to teat thejww^io^ associations betweeu grand- 
parents and grandchildren, as distinct from the total associations 
given above, in order to throw light on the real nature of the 
inheritance. There are two such partial associations to be 
tested: (1) where the parents are light-eyed, (2) where they are 
not-light«yed. The following are the comparisons : — 

Grandparent* amd Grandchildren : Parents light-eyed. 

Proportion of light-eyod amongst the 1 {ABC) 1B28 
grandchildren of light-eyed grand- i =Td7^\ -^Mi~^^'* '^t o^tA. 
parents I ' ' 

Proportion of ^igh^eyed amongst the '\ (AB-j) 6Bfl 
grandchildren of not-light-eyed r ~ ,bA = hoT"^^ * " 
grondparantB . . .) ^ ^' ^,nj^. 



48 THEORY OF STATISTICS, 

GrwadparenU and Orandehildren : Parents ^wt-light-eyed. 

Proportion of light-eyed amongst the \ I ABU] hfii 

grandchildren of light-eyed grand- j =-rx7K ='Eirr — *8*3 peroent. 

parento ) 'p'"' "" 

Proportion of li^t-eyed amongst the 1 (AB^) 508 

grandchildren ot not-ligUt-eyed [-= ,£T "Ynffp"^'''^ ■> 

grandparenta . , , .J '"'^ 

In both oases the partial asaociatton is quite well-marked aad 
positive ; the total aasociation between grandparentis and grand- 
children cannot, then, be due wholly to the total associations 
between grandparents and parents, parents and children, re- 
spectively. There is an anceefral heredity, as it is termed, aa 
well as a parental heredity. 

We need not discuss the partial association between childr^i and 
parents, as it is comparatively of little consequence. It may be 
noted, however, as regards the above results, that the most 
important feature may be brought out by stating three ratios 

\iA and B are positively associated, {AE)l(E)>{A)jN. 

If A and C are positively associated in the universe of B's, 
{ABC)f(£C) > (AB)/(B). Hence (A)/JV, {AB)j{B), and {ABC)/(BC) 
form an ascending series. Thus we have from the given data — 

ProportioD of liKht-«yed amongst the'l 
children of light-eyed pareuta and y=(.ABC)j{B0) = &6-A „ 
grandparents . ■ . J 

IF the great-grandparents, etc., ete., were also known, the series 
might be continued, giving (ABCD)/{BCD), (ABCDE)/{BCDE), 
and so forth. The series would probably ascend continuously 
though with smaller intervals, A and J) being positively associated 
in the universe of BC'b, A and E in the universe of BCD% etc. 

6. The above examples will serve to illustrate the practical 
application of partial associations to concrete cases. The general 
nature ot the fallacies involved in interpreting aasociations 
between two attributes as if they were necessarily due to the 
moat obvious form of direct causation is more clearly exhibited 
by the following theorem : — 

If A and B are iTidependent within the universe of C'n and also 
vnlhin the universe of y's, they will n^vertheleet be aseodated 
teithin the universe at large, unlet* C i$ ind^endetU of either A 
or B or bntk. ,-~ , 

n„jN.«j-vljOOglC 



PARTIAL ASSOCIATION. 
The two data give — 
(ABC) 
(ASy) 



<*^ (31 

(AyXBy) i(A)-(AO]m-(mM • " 



(t) W 

Adding them together we have — 

Write, as in § 11 of Chap. HI. (p. 35>— 



^(AC}-(ACmBC)-(S(r),] . (4) 

This proves the theorem; for the right-hand aide will not be 
zero unless either {JC) = (-4C)o or {BC) = (BO^. 

7. The result indicates that, while no degree of heterogeneity 
in the universe can influence the association between A and B 
if all other attributes are independent of either .^ or £ or both, 
an illusory or misleading association may arise in any'oafie where 
there exists in the given universe a third attribute C with which 
both A and B are associated (positively or negatively). If both 
associations are of the same sign, the resulting illusory association 
between A and E will be positive ; if of opposite sign, negative. 
The three illustrations of g 2 are all of the first kind. In (1) it 
is argued that the positive associations between vaeemation and 
hygienic amditiont, exemption from attack and hygienic conditioTU, 
give rise to an illusory positive association between vaeeincUion 
and exemption from attack. lu (2) it is argued that the positive 
asaooiattODS between contervative and winning, contervative and 
. pending more, give rise to an illusory positive association between 
vri/aning and tpending more. In (3) the question is raised whether 
the positive association between grandparent and grandchild may 
not be due solely to the positive associations between grandparent 
and parent, parent and child. 

Misleading associations of this kind may easily arise through 



50 THKOET OF STATISTICS. 

tte mingling of records, e.g. respecting the two sexes, whicli a 
careful worker would keep distinct. 

Take the following case, for example. Suppose there have been 
200 patients in a hospital, 100 males and 100 females, suffering 
from some disease. Suppose, further, that the death-rate for males 
(the case mortality) has heen 30-per cent., for females 60 per cent. 
A new treatment is tried on 80 per cent, of the males and 40 per 
cent, of the females, and the results puhlished without distinction 
of sex. The three attributes, with the relations of which we are 
here concerned, are death, treatment and male sex. The data sbow 
that more males were treated than females, and more fen[tales 
died than males ; therefore the first attribute is associated nega- 
tively, the second positively, with the third. It follows that there 
will be an illusory negative association between the first two — 
death and trealmient. If the treatment were completely inefficient 
we would, in fact, have the following results : — 

Males. Females. Total. 
Treated and died . 

„ and did not die 
Not treated and died . 

and did not die . 14 24 38 

i.e. of the treated, only 48/120 = 40 per cent, died, while of those 
not treated 42/80 = 52-5 percent died. If this result were stat«d 
without any reference to the fact of the mixture of the sexes, to 
the different proportions of the two that were treated and to the 
different death-rates under normal treatment, then some value in 
the new treatment would appear to be suggested. To make 
a fair return, either the results for the two sexes should he. 
stated separately, or the same proportion of the two seiCB 
must receive the eiperimental treatment. Further, care would 
have to be taken in such a case to see that there was no 
selection (perhaps unconscious) of the less severe oases for treat- 
ment, thus introducing another source of fallacy (ihath positively 
associated with »everity, treatment negatively associated with 
severity, giving rise to illusory negative association between 
i^eatmsnt and death). 

A misleading association between the characters of parent and 
offspring might similarly be created if the records for male-male 
and female-female lines of descent were mixed. Thus suppose 50 
per cent, of males and 10 per cent, of females exhibit some 
attribute for which there is no association in either line, then we 
would have for each line and for a mixed record of equal 
numbers — -, , 

n,gN..(JNCjOOgle 



PARTIAL ASaOCUTIOM. 01 

Hale line. Female line. Mixed record. 

Parents with attribute aod In- . , .. la „ , 

children with . ,} 25 percent. Ipercent, 13percent. 

Parentswithattributeand (^ „= a i^ 

children without . , ) " " " 

Parents without attribute I „§ a 17 

and children with .J " " " 

Parents without attribute [^k 01 50 

and children withoiit , ( " " " 

Here 13/30 = 43 per cent, of the ofispring of parents with the 
attribute poBseaa the attribute themselTes, but only 17/70 = 24 
per cent, of the offspring of parents without the attribute. The 
association between attribute in parent and attribute in offering 
is, however, due solely to the association of both with male sex. 
The student will see that if records for male-female and female- 
male lines were miied, the illusory association would be negative, 
and that if all four lines were combined there would be no illusory 
association at all. 

8. lUixsory associations may also arise in a different way 
through the personality of the observer or observers. If the 
observer's attention fluctuates, he may be more likely to notice 
the presence of A when he notices the presence of B, and vice 
versd ; in such a case A and B (so far as the record goes) will both 
be associated with the observer's attention C, and consequently 
an illusory association will be created. Again, it the attributes 
are not well defined, one observer may be more generous than 
another in deciding when to record the presence of A and also 
the presence of B, and even one observer may fluctuate in the 
generosity of his marking. In this case the recording of A and 
the recording of B will both be associated with the generosity 
of the observer in recording their presence, C, and an illusory 
aasooiation between A and B will consequently arise, as 
before. 

9. It is important to notice that, though we cannot actually 
determine the partial associations unless the third-order frequency 
{ABC) is given, we can make some conjecture as to their sign 
from the values of the second-order frequencies. 

Suppose, for instance, that — 





■ ■ • (5) 


,i-,Goo<^lc 



THSORY OF STATISTICS. 



SO that Sj and Sj are positive or negative according as A and B 
are positively or negatiifely associated in the universes of C and 
y respectively. Then we have by addition — 



m * (t) 



(^2i).'-f^!lftp + >i™a + s, + S, . . (6) 



Hence if the value of {AB) exceed the value given by the first 
two terms (i.e. it Sj + 8^ be positive), A and B must be positively 
associated either in the universe of C's, the universe of y's, or 
both. If, on the other hand, (AB) fall short of the value given by 
the first two terms, A and B must be negatively associated in 
the universe of Cs, the universe of -/b, or both. Finally, if 
(AB) be equal to the value of the first two terms, A and S must 
be positively associated in the one partial universe and negatively 
in the other, or else independent in both. 

The expression (6) may often be used in the following form, 
obtained by dividing through by, say, {B) — 

{AB)_(A£) (iQAM <MJ>±}i m 

(.3)- (C) ■ (B) + (,) ■ (£)*'(£} ■ ■ '■'> 

In using this eipression we make use solely of proportions or 

percentages, and judge of the sign of the partial associations 
between A and B accordingly. A concrete case, as in Example iii. 
below, is perhaps clearer than the general formula. 

Example iii. — (Figures compiled from Sttpplement to the Fifty- 
fifth Animal Report of the Regutrar-Gmend [C— 8503], 1897.) 
The following are the death-rates per thousand per annum, and the 
proportions over 65 years of ago, of occupied males in general, 
farmers, textile workers, and glass workers (over 15 years of age 
in each case) during the decade 1891-1900 in England and Wales. 



Occupied males over 15 . 15"8 46 

Farmers „ „ . . 196 132 

Textile workers, males over 15. 15-9 34 

Glass workers „ „ . 16'6 16 

Would farming, textile working, and glass working seem to be 
relatively healthy or unhealthy occupations, given that the death- 
rates among occupied males from 15-65 and over 65 years of age 
are 11'5 and 102-3 per thousand respectively t 

If A denote death, B the given occupation, C old age, we have 



PARTIAL ASSOCIATION. 53 

to apply the principle of equation (7). Calculate what would be 
the death-rate for each occupation on the eupposition that the 
death-ratea for occupied males in general (11'5, 102'3) apply to 
each of its separate age^roups (under 65, over 65), and eee 
whether the total death-rat« so calculated exceeds or fails short 
of the actual death-rate. If it exceeds the actual rate, the 
occupation must on the whole be healthy ; if it tstla short, un- 
healthy. Thus we have the following calculated death-rates : — 
Farmers. . . ll-5x ■868-h102-3x ■132 = 23-5. 

Textile workers . 11-5 x -966 -f- 102-3 x "034- 146. 

Glassworkers . . 115 x '984 + 102-3 x -016 = 13-0. 

The calculated rate for farmers largely exceeds the actual rate ; 
farming, then, must on the whole, as one would expect, be 
a healthy oceupation. Tho death-rate for either young farmers 
or old farmers, or both, must be less than for occupied males in 
general (the last is actually the case) ; the high death-rate 
obserred is due solely to the large proportion of the aged. Textile 
working, on the other hand, appears to be unhealthy (14'6<15'9), 
and glass working still more so (13-0< 16-6) ; the actual low total 
death-rates are due merely to low proportions of the aged. 

It is evident that age-distributions vary so lately from one 
occupation to another that total death-rates are liable to be very 
misleading — so misleading, in fact, that they are not tabulated at all 
by the Registrar-General ; only doath-tates for narrow limito of age 
(6 or 10 year age-clasees) are worked out. Similar fallacies are 
liable to occur in comparisons of local death-rates, owing to 
variations not only in the relative proportions of the old, but also 
in the relative proportions of the two sexes. 

It is hardly necessary toobserve that as aj;e is a variable quantity, 
the above procedure for calculating the above comparative death- 
rates is extremely rough. The death-rate of those engaged in any 
occupation depends not only on the mere proportions over and under 
65, but on the relative numbers at every single year of age. The 
simpler procedure brings out, however, better than a more complex 
one, the nature of the fallacy involved in assuming that crude death- 
rates are measures of healthiness. [See also Chap. XI. gg 17-19.] 

Example iv. — Eye-colour in grandparent, parent and child. 
(The figures are those of Example ii.) 

A, light-eyed child ; B, light-eyed parent ; C, light-eyed grand- 
parent. 

J' -5008 (J£) = 2524 

(.i) = 3584 MC) = 2480 

(5) = 3052 f.BC) = 2231 

('?) = 3178 ,„.,G00glc 



' STATiaTlCS. 



Given only the above data, investigate whether there is probably 
a partial asBOciatioo between child and grandparent. 
If there were no partial association we would have — 



(AC): 



_ (AS}(BC} (A^)(0C) 

' 2524 X 2231 1060 x 947 

" 3052 ■*■ 19B6 
-18450 + 513-2 
= 2358-2. 

Actually (Jt7) = 2480; there must, then, be partial association 
either in the .B-univerBe, the ^-universe, or both. In the absence 
of any reason to the contrary, it would be natural to suppose there 
is a partial association in both; i.e. that there is a partial 
association with the grandparent whether the line of descent 
passes through "light-eyed" or "not-lightreyed" parents, but this 
could not be proved without a knowledge of the class-frtiquency 
(^ABC). 

10. The total possible number of associations to be derived from 
n attributes grows so rapidly with the value of n that the evalua- 
tion of them all for any case in which «■ is greater than four 
becomes almost unmanageable. For three attributes there are 9 
possible associations — three totals, three partials in positive 
universeB, and three partials in negative universes. For four 
attributes, the number of possible associations rises to 54, 
for there ate 6 pairs to be formed from four attributes, and 
we can find 9 associations for each pair (1 total, 4 partials 
with the universe specified by one attribute, and 4 partials 
with the Tiniverse specified by two). For five attributes the 
student will find that there are no less than 270, and for sii 
attributes 1215 associations. 

As suggested by Examples i. and ii. above, however, it is not 
□ecessary in any actual case to investigate all the associations 
that are theoretically possible ; the nature of the problem indicates 
those that are required. 

In Example i., for instance, the total and partial associations 
between A and D were alone investigated ; the associations between 
A and B, B and D were not essential for answering the question 
that was asked, la Example ii., again, the three total associations 
and the partial association between A and C were worked out, 
but the partial associations between A and B, B and C were 
omitted as unnecessary. Practical considerations of this kind will 
always le^en the amount of necessary labour, .-. , 



PARTIAL ASSOCUTION. 55 

11. It might appear, at first sight, that theoretical considera- 
tiona vould enable ob to lessen it still further. As we saw in 
Chapter I., all claas-frequencies can be expressed in terms of those 
of the potitive classee, of which there are 2" in t^e case of n 
attributes. For given values of the «+ 1 frequencies N, {A), {B), 
(C), ... of order lower than the second, assigned values of the 
positive class-frequencies of the second and higher orders must 
therefore correspond to determinate values of ail the possible 
associations. But the number of these positive claea-frequencies 
of the second and higher orders is only 2" -n + 1 ; therefore the 
number of algebraically vndependemt ait ociatio m that can be 
derived from n attributes is only 2" - n + 1. For successive 
values of n this gives — 

2 1 



6 57 

Hence if we give data, in any form, that determine four 
associations in the case of three attributes, eleven in the case of 
four attributes, and soon, iu addition to.ff^and theclasa-freqnencies 
of the first order, we have done all that is theoretically necessary. 
The remaining associations can be deduced. 

12, Practically, however, the mere fact that they eon be deduced 
is of little help unless such deduction can be effected simply, 
indeed almost directly, by mere mental arithmetic almost, and 
this is not the case. The relations that exist between the ratios 
or differences, such as (AS) - (AB)^, that indicate the associations 
are, in fact, so complex that an unknown association cannot be 
determined from those that are given without more or less lengthy 
work ; it is not possible to infer even its sign by any simple 
process of inspection. We have, for instance, from (5), by the 
process used in obtaining (4) for the special case of § 6 — 

which gives us the difference of {ABy) from the value it would 
have if A and B were independent in the universe of ys in terms 
of the difference of (ABC) from the value it would have if A and 



66 THEORY OF 8TATI8TICS. 

B were indepeDdent in the universe of C% and the oorreaponding 
difierenceB for the frequencieB {A£), (AC), and (BC). The four 
quantities in the brackets on the right represent, say, the four 
known associations, the bracket on theleft the unknown association. 
Clearly, the relation is not of such a simple kind that the term on 
the left can be, in general, mentally evaluated. Hence in con- 
sidering the choice and number of associations to be actually 
tabulated, regard must be had to practical considerations rather 
than to theoretical relations. 

13. The particular case in which all the 2"- w-(-l given associa- 
tions are zero is worth some special investigation. 

It follows, in the first place, that all other possible associations 
must be zero, i.e. that a state of complete independence, as we 
may term it, exists. Suppose, for instance, that we are given — 




that we have also — 



i.e. A and C are independent in the universe of B'a, and B and C 
in the universe of ^'s. Again, 



JF W ■ 

Therefore A and B are independent in the universe of y's. 
Similarly, it may be shown that A and C are independent in the 
universe of ^a, £ and C in the universe of a's. 

In the next place it is evident from the above that relations of 
the general form (to write the equation symmetrically) 

(ABC)JA) (B) {C} 

N N ■ N ■ N ■ • ■ *"' 

must hold for every class-frequency. This relation is the general 

form of the equation of independence, (2) [d), Chap. III. (p. 26). 

14. It must he noted, however, that (8) is not a criUrwn for the 



(9) 



PARTIAL AaSOCIATlON. 57 

complete independenee of A, B, and C in the Bense that the 
equation 

(AB) (A) (B) ■ 
If N - N 

is a criterion for the complete indepeodence of A and B. If we 
are given N, (A), and (£), and the last relation quoted holds 
good, we know that aimilar relations must hold for (A^), {oB), 
and (a/3). If N, {A), (£), and (C) be given, however, and the 
equation (8) bold good, we can draw no conclusion without 
further information ; the data are insufficient. There are ei^ht 
algebraically independent clasa-frequencies in the case of three 
attributes, while ^, {A), (B), (C) are only four ; the equation (8) 
must therefore be shown to hold good for/oiw frequencies of the 
third order before the coaolusion can be drawn that it holds good 
for the remainder, i.e. that a state of complete independence 
subflistB. The direct verification of this result is left for the 
student. 

Quite generally, it #, (A), (B), (fi), .... be given, the relation 

{ABC . . . .) _{A) (B) {CI 

]V If ■ Jf - Jf ■ ' ■ ■ 

must be shown to bold good for 2" - n + 1 of the rath order classes 
before it may be assumed to hold good for the remainder. It is 
only because 

when «=2 that tie relation 

{AB)JA) (B) 
If N ■ N' 

may be treated as a eriterwn for the independence of A and B. 
If all the 71 {n>2) attributes are completely independent, the 
relation (9) holds good ; but it does not follow that if the relation 
(9) hold good they are all independent 

REFEBENCES. 

(1) YuLK, G. U., "On the ABsociation of Attribntes in Statistics,'* FhU. 
Trans. Boy. Soe., Series A, vol. ciciv., 1900, p. 267. (Deali fully 
with the theory of partial as well as of total aesociation, wilii numerooa 
illuBtratiaua : a notatioD suggested for the partial coefficieots, ) 

(21 YuLB, G. U., "Notes on the Theory of Association of AtbibuteB in 
Statiatica," Biometrika, vol. ii., 1S03, p. 121. (Cf. especially §g i and 
G, ou the theory of complete independenoe, aud the fallacies due to 
mixing of Tecorda.} 



THEORY OF STATISTICS. 



1. Take the following ligurea for girls corresponduig to those for bajrs in 
Example i., p. 46, and disonBS them similarly, but not necessarily nsiog 
exactly the same comparieons, to see whether the oonolnston that " the 
connectiog link between defects of body aad mental dalness is the coinoident 
defeat of brain which may be known by observation of abnormal nerve aigna " 
s^ems to hold good, 

A, development defects. B, nerve signs. D, mental ditlness. 

JV 10,000 (AB) 218 

(A) Q82 lAn) 807 

(B) 860 (BD) 868 
(Z>) ese (ABD) 128 

2. (Material from Censiu of England and (Tales, IS91, vol iii.) The 
following figoies give the numbers of those suffering from single or combined 
infirmities ; (1) for all males, (2) for males of 56 years of fgfi and over. 

A, Blindness. B, Mental dentngement. 0, Deaf-mutism. 

(1) (2) (1) (2) 

All Males. Males SS- All Males. Males 66- 

If li,068,O00 1,877,000 {AB) 183 66 

(A) 12,281 5,538 (AC) 61 14 

{B) 46,892 10,800 {BG) 299 47 

(C) 7,707 746 (ABC) 11 3 

Tabulate proportions per thousand, eihibiting the total association between 
blindness and mental deranfremaut, and the partial association between the 
same two iniirmitiesamoug deaf-mutes, (1) for males in general, (2) for those 
of 66 years of age or over. Oive a short verbal statement of the resnlts, and 
contrast them with those of Qnastion 1. 

8. (Material from supplement to 55th Annual Keport Reg.-Genl.) 

The death-iate from cancer for occapied males in general (over 16) is 
0'S86 per thousand per annum, and for farmers I "20. 

The death-rates from cancer for occupied males under and over 46 respec- 
tively are 0'13 and 2'26 respectively. Of the farmers 46'1 percent, are over 
46. 

Would you say that farmers were peculiarly liable to cancer 1 

4. A population of males over 16 years of age consists of 7 per cent, over 66 
years of age and 93 par cant- under- The death-rates are 12 per thousand per 
annum in the younger class and 110 in the older, or 18'86 in the whole 
population. The death-rate of males [over 16) engaged in a certain industry 
is 26 '7 per thousand. 

If the industry be n< 
of those over 65 e 
distribntion) 1 

E. Show that if A and B are Independent, while A and C, £ and C are 
associated, A and B must he disassociated either in the universe of Cs, 
the universe of y'a, or both. 

6, As an illustration of Question 5, show that if tha following were actual 
data, there would be a slight disassociation between the eye-coloura of 
husband and wife (father and mother) for the parents either of light-eyed 
sons or not- light-eyed sons, or both, although there is a slight positive 
association for parents at laree. , ~ i 



PARTIAL ASSOCIATION. 
A light-eje colour in hosband, B in wife, C in son — 



N 


1000 


(-i») 


{-<) 


(122 


M(7) 


(■B) 


5&8 


(«0 


(CO 


617 





7. Sbow that if (ASC) = {aBy), <.tBC) = iA$y), and so on (the ca 
"complete equality of oontrarv frequencies" of Questioti 7, Chap. I.), , 
and C aw completely independent if A and B, A and (7, S and C are 
pendent pair and pair. 

8. If, u the Bsme case of complete equality of oontrarieB, 

{AB)-N/i = S^ 

(BG) -Nji = t, 
show that 



! (ABU) 



-y^5fa]={(.«-<-«p-']= 



so that the partial asBociatiana between A and B in 
positive or negative sccotding ae 

•'< A- 

9. In the simple contesti of a general election (contests in which one 
OonserFBtive opposed one Liberal and there were no other candidates) 66 per 
cent, of the winning candidates (according to the returns] spent more money 
than their opponents. Giren that 68 per cent, of the winnen were Con- 
servatives, and that the ConaerTBtive expenditure exceeded the Liberal in SO 
per cent, of the contests, find the psTcentagesof elections won by Conaervativea 
(1) when they apeiit more and (2) when they spent leiss than their opponents, 
and hence aaj whether yon conaider the above G^ures evidence of the influence 
of expenditure on election results or no. {Note that if the one candidate in a 
contest be a Coraeroaiive-ivimitr-'ieho spends more than his qppiHiCTii— the 
other must ueceisarOy be a Liberalloser-icho tpenda less — and so fortll. 
Hence the case ia one of complete equality of contrariea, ) 

10. Given t^s.i {A)) N-{B)IN={C)!N=x, and that (.AB)/y=(AC)IIf=y. 
find the major and minor limita to y that enable one to infer posiUve associa- 
tion between .B and C,i.i. (BC)IN>xK 

Draw a diagram on squared paper to illastrate your answer, taking x and y 
as co-ordinatea, and shading the limits within which y must lie in order to 
peniiit of the above inference. Point oat the peculiarities in the case of in- 
ferring a positive association from two negative asaociationa. 

11. Discuss airailarly the more complex case (A)IN=x, {B)/N=2x. (C]/N= 
3a!!— 

(1) for inferring positive association between B and C given (AB)/^^ 

{AC)IN=y. 

(2) for inferring positive association between A and C given {AB)/N= 

(BOIN=y. 
(8) for infemng positive association between A and B given (AO)/N= 
{BCnilf=y. 



n,gN..(JNGOOglC 



MANIFOLD OLASSIFIOATION. 

1, The general prinoiple of a mamfold cluaificatioii — 2-4. The tabic of 
double-entry or contingeucy table and its treatment by fundamental 
methcda — 5-8. The coefficient of oontiMenoy — 9-10. AnalyBia of 
aoontingency table by tetrads — 11-13. Isotropic and anisotropic 
diBtributiooa — H-16. Homogeneity of the claesifioations dealt with 
in this and the preceding chapters : heterogeneous olasaificationB. 

1. Classification by diobotomy is, as was briefly pointed out in 
Chap. I. § 5, a simpler form of claesification than ijaually occurs 
in the tabulation of practical Btatistics. It may be regarded as 
a special case of a more general form in which the individuals or 
objects observed are first divided under, say, » heads, A^A„.... 
An each of the classes so obtained then subdivided under t heads, 
B^, B^ . . . . B^ each of these under u heads, C-^, C^ . . . , C^ and 

so on, thus giving rise to s. (. m ultimate classes altogether. 

2. The general theory of such a majiifold as distinct from a 
twofold or dichotomous classification, in the case of n attributes 
or characters ABO . . . . Jf, would be extremely complex : in the 
present chapter the discussion will be confined to the case of two 
characters, A and B, only. If the classification of the A'a be t- 
fold and of the B's (-fold, the frequencies ot the tt classes of the 
second order may be most simply given by forming a table with 
t columns headed A^ to A„ and t rows headed B^ to B,. The 
number ot the objects or individuals possessing any combination 
of the two characters, say A^ and B„ i.e. the frequency of the 
class A„B„ is entered in the compartment common to the with 
column and the rath row, the et compartments thus giving all 
the second-order frequencies. The totals at the ends of rows 
and the feet of columns give the firstrorder frequencies, i.e. the 
numbers of A„'b and B^'s, and finally the grand total at the 
right-hand bottom comer gives the whole number of observations. 
Tables I. and II. below will serve as illustrations of such tables 
of double-entry or contingency tables, as they have been termed 
by Professor Pearson (ref, 1). ( "onoir 



MANIFOLD GLASSIFICATION. 



61 



3. In Table T. the division is 3 x 3-fo!d : tbe houses in England 
and Wales are divided into tht^e which are in (1) London, (2) 
other urban districts, (3) rural districts, and the houses in each 
of these divisions are again ohiasi&ed into (1) inhabited houBee, 
(2) uninhabited but completed houses, (3) houses that are 
" building," i.e. in course of erection. Thus from the first row 
we see that there were in London, in round numbers, 616,000 
houses, of which 571,000 were inhabited, 40,000 uninhabited, 
and 5000 in course of erection : from the first column, there 
were 6,260,000 inhabited houses in England and Wales, of which 
571,000 were in London, 1,064,000 in other urban districts, and 
~ 1,625,000 in rural districts. 







Unin- 
habited. 


Building. 


Total. 


Adm. County of London 
Other arbsn diBtriots . 
Burel diBtricts . 

Tote] for BngUnd uid W»1«e 


671 

4<I04 
1825 


«0 
285 
124 


6 

45 
12 


9ie 

4S94 

1781 ■ 


8260 


449 


82 


8771 



In Table IL, on the other hand, the classification is 3 x 4-fold : 
the eje-colours are classed under the three heads " blue," " grey or 
green," and "brown," while the hair-colours arc classed under 
tour heads, "fair," "brown," "black," and "red," The table is 



— 


Hair-ooloar. 


Tot.1. 


Fair. 


^^ 


Blwk. 


B»d. 


Blue .... 

Grey or Green 

Brown .... 

Total . 


1768 
»4S 

Its 


807 
1387 
438 


189 
748 
288 


47 
53 
16 


2811 
3132 

857 


2828 


26B2 


1223 


116 


8800 



Gooi^lc 



62 THEORY OP STATISTICS. 

reiid Bimilarlj to the la^t. Taking the first row, it tells us that 
there were 2811 men with blue eyes noted, of whom 1768 had 
fair hair, 807 brown hair, 189 black hair, and 47 red hair. 
Similarly, from the firet column, there were 2829 men with fair 
hair, of whom 1768 had blue eyes, 946 grey or green eyes, and 
115 browD eyes. The tables are a generalised form of the four- 
told (2 X 2-fold) tables in § 13, Chap. III. 

4. For the purpose of diBcuaaii^ the nature of the relation 
between the A'b and the S's, any such table may be treated on 
the principles of the preceding chapters by reducing it in different 
ways to 2 X 2-fold form. It then becomes possible to trace the 
association between any one or more of the A's and any one or 
more of the S's, either in the universe at large or in universes 
limited by the omission of one or more of the A's, of the B\ or 
of both. Taking Table I., for example, trace the association 
between the erection of houses and the urban character of a 
district. Adding together the first two rows — i.e. pooling London 
and the other urban districts together — and similarly adding the 
first two columns, so as to make no distinction between inhabited 
and uninhabited houses as long as they are completed, we find — 
Proportion of all houses which j 

are in course of erection in V 50/5010= 10 per thousand. 

urban districts . . . I 
Proportion of all houses which i 

are in course of erection in V 12/1761 =7 

rural districts . . ) 

There is therefore, as might be expected, a distinct positive , 
association, a larger proportion of houses being in course of 
erection in urban than in rural districts. 

If, as another illustration, it be desired to trace the association 
between the " uninhabitedness " of houses and the urban character 
of the district, the procedure will be rather diSerent. Sows 1 
and 2 may be added together aa before, but column 3 may be 
omitted altogether, as the houses which are only in course of 
erection do not enter into the question. We then have — 
Proportion of all houses which 1 

are uninhabited in urban V 325/4960 = 66 per thousand. 

districts . . .1 

Proportion of all houses which j 

are uninhabited in rural V 124/1749 = 71 

districts . . . . ) 
The association is therefore negative, the proportion of houses 
uninhabited being greater in rural than in urban districts. , 



MANIFOLD CLAB8IPICATI0N. 63 

The eye- aad hair-oolour data of Table II. ma; be treated in a 
precisely similar faahiOD. If, e.g., we desire to trace the aesocia- 
tion between a lack of pigmeutatioQ in eyea and in hair, rows 1 
and 2 may be pooled together as representing the least pigmenta- 
tion of the eyes, and columns 3, 3, and 4 may be pooled together 
as representing hair with a more or lees marked degree of 
pigmentation. We then have — 



""T^W "' "^^^^'^. ''"* } 2714/6943 - 46 per cent. 
115/857 =13 „ 



ftur hair . . . . / 

Proportion of brown-eyed with ( 

fair hair . f 



The association is therefore well-marked. For comparison we 
may trace the corresponding association between the most marked 
degree of pigmeatatioa in eyee and hair, i.e. brown eyes and 
black hair. Here we must add blether rows 1 and 2 as before, 
and columns 1, 2, and 4 — the column for red being really mis- 
placed, as red represents a comparatively slight degree of pigmenta- 
tion. The figures are — 

^bS* TaT/ ^™'™'^^^ ''*^ } 288/857 = 34 per cent. 

^TChJii^^ "ighteyed ^'^^^ Uz5j5^iZ = U „ 

The asaociation is again positive and well-marked, but the 
difference between the two percentt^s is rather less than in the 

last case. 

5. The mode of treatment adopted in the preceding section rests 
on first principles, and, if fully carried out, it gives the most 
detailed information possible with regard to the relations of the 
two attributes. At the same time a distinct need is felt in 
practical work for some more summary method — a method which 
will enable a single and definite answer t^ be given to such a 
question as —Are the A's on the whole distinctly dependent on the 
B'a ; and if so, is this dependence very close, or the reverse 1 The 
coefficient of association, which affords the answer to this question 
in the case of a dichotomous classification, was only dealt with 
briefly and incidentally, for where there are only four classes of the 
second order to be considered the matter is not nearly so complei as 
where the number is, say, twenty-five or more, and the need for any 
summary coefficient is not so often nor so keenly felt ; moreover, the 
coefficient most widely used (Chap. III. ref. 3) is hardly susceptible 
of elementary treatment. The ideas on which Professor Pearson's 
general measure of dependence, the " coefficient of contingency," is 
based, are, however, quite simple and fundamental, an^t(i^^]fpj^e of 



64 THXOBT OF STATISTICS. 

calculation is therefore given id full in the following aection. The 
advanced student should refer to the ori'ginal memoir (ref. 1) for 
the complete treatment of the theory of the coefEcient, and of its 
relation to the theory of variables. 

6. Generaliaing alightlj the notation of the preceding chapters, 
let the frequency of j1„'8 be denoted by {A„), the frequency of 
B,'& by (£„), and the frequency of objects or individuaJe poaaesaing 
both characters by (J„B„). Then, if the A's and B'a be com- 
pletely independent in the universe at large, we muat have for all 
values of m and n-~ 

(^,A)-<A.)(|-).(^.i;.), . . . (1) 

If, however, A and B are not completely independent, (A^B,) and 
(A^B„)g will not be identical for all values of m and n. Let 

the difTerence be given by 

S„, = {A„B,)-{A^B„), . . . (2) 

A coefficient such as we are seeking may evidently be based in 
some way on these values of 8. It will not do, however, simply to 
add them together, for the sum of all the values of 8, some of 
which are negative and others positive, must be zero in any case, 
the sum of both the (ABYb and the (AS)q'r being equal to the 
whole number of observations JV^. It is necessary, therefore, to 
get rid of the signs, and this may be done in two simple ways ; (1) 
by neglecting them and forming the arithmetical instead of the 
algebraical sum of the differences 8, or (2) by squaring the differ- 
ences and then summing the squares. The first process ia the 
shorter, but the second the better, as it leads to a coefBcient 
easily treated by algebraical methods, which the first process 
does not: as the student will see later, squaring is very 
usefully and very frequently employed for the purpose of elimin- 
ating algebraical signs. Suppose, then, that every 8 is calculated, 
and also the ratio of its square to the corresponding value of 
{AB)fi, and that the sum of all such ratios is, say, )^ ; or, in 
symbols, using 2 to denote " the sum of all quantities like " : — 



{(AX)) 



. . (3) 

Being the sum of a series of squares, x^ is necessarily positive, 
and if A and B be independent it is zero, because every S is zero. 
If, then, we form a coefficient C given by the relation 

^=yj^ ■ ■ ... i (4) 

"* ,,i-,Gooj^le 



MANIFOLD CLASSIFICATION. 65 

this coefficient is zero if the characters A and £ are completely 
independent, and approaches more and more nearly towards 
unity as y^ increasea. In general, no sign should he attached 
to the root, for the coefficient simply shows whether the two 
characters are or are not independent, and nothing more, but in 
Boms cases a conventional sign may he used. Thus in Table II. 
slight pigmentation of eyes and of hair appear to go together, 
and the contingency may be regarded as definitely positive. If 
slight pigmentation of eyes had been associated with marked 
pigmentation of bair, the contingency might have been regarded 
as negative. C is Professor Pearson's mean square contti^ency 
coefScient.1 

7. The coefficient, in the simple form (4), has one disadvant^e, 
viz. that coefficients calculated on different systems of classi- 
fication are not comparable with each other. It is clearly desir- 
able for practical purposes that two coefficients calculated from 
the same data classified in two different ways should be, at least 
approximately, identical. With the present coefficient this is not 
the case : if certain data be classified in, say, (1) 6 x 6-fold, (2) 
3 X 3-fold form, the coefficient in the latter form tends to he the 
least. The greatest possible value of the coefficient is, in fact, 
only unity if the number of classes be infinitely great ; for any 
finite number of classes the limiting value of C is the smaller the 
smaller the number ot classes. This may be briefly illustrated as 
follows. Replacing S™, in equation (3) by its value in terms of 
{A^B,) and (A^^\ we have— 

^-A'^}-- ■ ■ ■ <^) 

and therefore, denoting the expression in brackets by S, 

0-/^'^ . . . . ,e, 

Now suppose we have to deal with a ( x ^fold classification in 
which {ii„) = (.S„) for all values of m ; and suppose, further, that 
the association between A„ and B^ is perfect, so that {A^B„) = 
(A„) = (B„) for all values of m, the remaining frequencies of the 
second order being zero; all the frequency is then concentrated 
in the diagonal compartments of the table, and each contributes 

' Professor Psarson (ref. 1) termsSa Bub-coDtineency ; x' ^^^^'"'^cntiu- 
gencj ; the ratio x*/-^. which he denotes by ^', the mean square oontingeDoy ; 
uid the sum of all Ule S'a of one sigD only, on which a. ditferent coefficient con 
be based, the mean contiiigencj. 



THEORY OF STATISTICS. 



V^ 



This is the greatest possible value of C for a aymmetrical t x Mold 
clasaificatioa, and therefore, in such a table, for — 



( « 2 t7 cannot exceed 0-707 


(= 3 


„ 0-816 


'= 4 


„ 0-866 


(= 5 


„ 0-894 


1- 6 


„ 0-913 


'- 7 


„ 0-926 


<- s 


„ 0-936 


'- 9 


„ 0-943 


1.10 


„ 0-949 



It ia as well, therefore, to restrict the use of the "coefBcient of 
contiogeiicj " to 5 x 5-fold or finer classifications. At the same 
time the class ifioation must not be made too fine, or else the value 
of the coefficient is largely afiected by casual irregularities of no 
physical significance in the class-frequencies (c/. the remarks in 
Chap. III. §g 7^8). 



Table lll.—InUpendejue-Faltu:^ ofiM FrequfneUt for Table II. 



Eye-colour, 


F»ir. 


Brown. 


BI«ok. 


Bed. 


Blue 

Gray or Green . ... 
Brown 


1169 
1303 

857 


1088 
1212 
832 


BOB 

663 
164 


i8-0 
68-4 
14-6 



8. As the classification of Table II. is only 3 x 4-fold, it is rather 
crude for the purpose of calculating the coefficient, hut will serve 
simply as an iUustration of the form of the arithmetic. In Table 
III. are given the values of the independence frequencies, 2829 x 
2811/6800=1169 and so on. The value of x* is more readily 
calculated from equation (5) than from (3) ; — 



n,gN..(jNGoogle 



MANIFOLD CLASaiPICATIOH. 



(1768W1169 


2673-9 


(946W1303 


686-8 


{115)7357 


37-0 


(807JV1088 


598-6 


(1387)71212 


1587-3 


(438)7332 


677-8 


1897506 


70-6 


746)7663 


988-5 


<288)7164 


538-6 


(47)V48'0 


46-0 


(53)753-4 


52-6 


(16)714-6 


17-5 


Totol-^- 


7876-2 


if. 


6800 


s-ir- 


1075-2 


. „ / 1076-2 
■ V 7875-2 


: 70365 = 



The squares in such work may conveniently be taken from 
Barlow's Tablet of Squares, Cubes, etc. (see list oF tables on 
p. 353), or li^;arithmfl may be used throughout — five-figure 
logarithmB are quite sufficient. 

9. While such a coefficient of contingency, in some form or 
other, is a great convenience in many fields of work, its use 
should not lead to a neglect of those details which a treatment by 
the elementary methods of § 4 would have revealed. Whether 
the coefficient be calculated or no, every table should always bo 
examined with care to see if it exhibit any apparently significant 
peculiarities in the distribution of frequency, e.g. in the associa- 
tions subaiating between A^ and S„ in limited universes. A good 
deal of caution must be used in order not to be misled by casual 
irregularities due to paucity of observations in some compartments 
of the table, but important points that would otherwise be over- 
looked will often be revealed by such a detailed examination. 

10. Suppose, for example, that any four adjacent frequencies, 
say — 

{A.B.) {A„,3.) 

(J,S.„) (J„A„) 

are extracted from the general contingency table. Considering 
these as a table exhibiting the association between J„, and B„ in 
a universe limited to A^„^, S„B,^+i alone, the association is 
positive, negative, or aero according as {A„B„)/{A„^iB^) ie greater 



68 THEORY OF STATISTICS. 

than, legs than, or equal to the ratio (A„B^^i)/{A„^,B^i). The 
whole of the contingencj table oan be analysed into a series of 
elementary groups of four freqnenoies like the above, each one 
overlapping ita neighbours so that an r«-fold table contains 
{r — 1) (a - 1) such " tetrads," and the associations in them all can 
be very quickly determined by simply tabulating the ratios like 
(A„£„)/(A^^,B,), (A^B,^i)!(A^,B^+,), etc., or perhaps better, 
the proportions (A„B„)){(\„B^) + {A„,^iB,)}, etc., for every pair 
of columns or of rows, as may be most convenient. Taking the 
figures of Table II. as an illustration, and working from the 
rows, the proportions run as follows : — 

For row* 1 and 2. Por rows 2 and S. 

1768/2714 0-651 946/1061 0'892 

807/2194 0-368 1387/1825 0-760 

189/935 0-202 746/1034 0-721 

47/100 0-470 53/69 768 

In both cases the first three ratios form descending series, but 
the fourth ratio is greater than the second. The signs of the 
associations in the six tetrads are accordingly — 



The negative sign in the two tetrads on the right is striking, 
the more so as other tables for hair- and eye-colour, arranged in 
the same way, exhibit just the same characteristic. But the 
peculiarity will be removed at once it the fourth column be placed 
immediately after the first ; if this be done, i.e. if " red " be placed 
between "fair" and "brown" instead of at the end of the colour- 
aeries, the sign of the association in all the elementary tetrads 
will be the same. The colours will then ran fair, red, brown, 
black, and this would seem to be the more natural order, consider- 
ing the depth <^f the pigmentation. 

11. A distribution of frequency of such a kind that the 
association in every elementary tetrad is of the same sign 
possesses several useful and interesting properties, as shown in 
the following theorems. It will be termed an isotropic dis- 
tribution. 

(1) In an isotropic digtribution the siffn of the oftodation is 
the tame not only for every elementary tetrad of adjat^ent frequen- 
cies, but for every set of four frequencies in the compartments 
common to two rows and two eolumtis, e.g. (AnB„), (J^.J,), 



MANIFOLD CLASSIFICATION. 69 

For auppose that the sign of association in the elementftry 
tetrads is positive, ho that — 

and similarly, 

Then multiplying up and cancelling we have 

(A^B„){A„^A^,)>{A„^^,){A„B^^,) . . (3) 

That is to say, the association is still positive though the two 
columnB A^ and A„+2 ^^ d<^ longer adjaceat. 

(2) An itotropic dtetribution remains dofropic in whatever way 
it may be cortdemed by grouping together adjacent rows or colvaims. 

Thus from (1) and (3) we have, adding — 

{A^B.)[iA„^,B„^,) + {^„^^,+,)] > (J.-S,^,}[(J„+iS.) + (J«+ A)], 

that is to Bay, the sign of the elementary association is unaffected 
by throwing the {m + l)th and {m -(- 2)th columns into one. 

(3) As the extreme case of the preceding theorem, we may 
suppose both rows and columns grouped and regrouped until 
only a 2 X 2-fold table is left ; we then have the theorem — 

If an isotropic distribution be reduced to a fourfold ditPribution 
in any way whatever, by addition of adjacent rows and columns, ■ 
the sign of the ansociation in such fowrfold table ia the same as in 
the elementary tetrads of the original table. 

The case of complete independence is a special case of isotropy. 
For if 

(A„B,)~{A„)(B,yjf 

for all values of m and n, the association is evidently zero for 
every tetrad. Therefore the distribution remains independent 
in whatever way the table be grouped, or in whatever way the 
universe be limited by the omission of rows or columns. The 
expression "compl.^ independence " is therefore justified. 

From the work of the preceding section we may say that Table 
II. is not isotropic as it stands, but may be regarded as a dis- 
arrangement of an isotropic distribution. It is best to rearrange 
such a table in isotropic order, as otherwise different reductions 
to fourfold form may lead to associations of different sign, though 
of course they need not necessarily do so. 

12. The following will serve aa an illustration of a table that 
is not isotropic, and cannot be rendered isotropic by any rearrange- 
ment of the order of rows and columns. 



THEORY OF STATISTICS. 



2. Blae-greeQ, gnj, 3. D&rk grej, lliuel. 4. BrowD. 





1. 


2. 


3. 


4. 


Total. 


1 

2 
S 


1S4 
83 
25 
66 


70 
36 


4] 

41 
66 
43 


80 
36 
S3 

log 


385 

284 
1S7 
244 


Total 


368 


'" 


180 


I9S 


1000 



The followiag are the ratios of the frequency in column ) 
ihe sum of the frequenciee in columns m and m+ 1 : — 





CoLUkriis 




and 2. 


2 ftsd S. 


3 and 4 


0-735 


0-631 


0-577 


0-401 


0'752 


0-532 


0-124 


0-382 


0-705 


0609 


0-456 


0-383 



The order in which the ratios run is different for each pair of 
columns, and it is accordingly impossible to make the table 
iaotropic. The distribution of signs of association in the several 
tetrads is— 



The distribution is a curious one, the associations in tetrads 
round the diagonal of the whole table being so markedly positive 
and those in the immediately adjacent tetrads equally markedly 
negative. Neglecting the other signs, this is the effect that 
would be produced by taking an isotropic distribution and then 
increasing the frequencies in the diagonal compartments by a 
sufficient percentage. Comparison of the given table with others 
from the same source shows that the peculiarity is common to 



MANIFOLD CLASSIFICATION. 71 

the great majority of the tables, and accordingly ite origin 
demandB explanation. Were such a table treated by the method 
of the contingency coefficient, or a similar summary method, 
alone, the peculiarity might not be remarked. 

13. It may be noted, in concluding this part of the subject, 
that in the case of oomplete independence the distribution of 
frequency in every row is similar to the distribution in the row 
of totals, and the distribution in every column similar to that in 
the column of totals ; for in, say, the column A„ the frequencies 
are given by the relations — 

(A-B^J^W, ^A.B,)J-^\Bi, (A.B,)-i^(B)^ 

and so on. This property is of special importance in the theory 
of variables. 

14. The classifications both of this and of the preceding chapters 
have one important characteristic in common, viz. that they 
are, so to speak, " homogeneous " — the principle of division 
being the same for all the sub-classes of any one class. Thus 
A's and a's are both subdivided into ^'s and ^'s, A-^'b, A^'a .... 
A,'b into By's, B^'b .... B,'b, and so on. Clearly this is necessary 
in order to render poBsible those comparisons on which the 
discussions of ^sociations and contingencies depend. If we 
only know that amongst the A'b there la a certain percentage 
of B's, and amongst the a's a certain percentage of Cs, there 
are no data for any conclusion. 

Many classifications are, however, essentially of a heterogeneous 
character, e.g. biological classifications into orders, genera, and 
species ; the classifications of the causes of death in vital 
statistics, and of occupations in the census. To take the last 
case as an illustration, the first "order" in the list of occupations 
is " General or Local Government of the Country," subdivided 
under the headings (1) National Government, (2) Local Govem- 
meat. The next order is " Defence of the Country, " with the sub- 
headings (1) Army, (2) Navy and Marines — not (1) National 
and (2) Local Government again — the sub-heads are necessarily 
distinct. Similarly, the third order is " Professional Occupations 
and their Subordinate Services," with the freah aub-heada (1) 
aerical, (2^ Legal, (3) Medical, (4) Teaching, (5) Literary and 
Scientific, (6) Engineers and Surveyors, (7) Art, Music, Drama, 
(8) Elihibitions, Games, etc. The number of sub-heads under 
each main heading is, in such a case, arbitrary and variable, 
and different for each main heading ; but so long as the 
classification remains purely heterogeneous, however mmplex 



72 THEORY OP STATISTICS. 

it amy become, there is no opportunity for anj disouesion 
of oausatioD withiii the limits of the matter so derived. It ia 
only when a homo^neous division is in some way introduced 
that we can begin to speak of associations and contingencies. 

15, This may be done in various ways according to the 
nature of the case. Thus the relative frequencies of different 
botanical families, genera, or species may be discussed in 
connection with the topc^raphical characters of their habitats- 
desert, marsh, or moor — and we may observe statistical associa- 
tions between given genera and situations of a given topographical 
type. The causes of death may be classified according to sex, 
or age, or occupation, and it then becomes possible to discuss 
the association of a given cause of death with one or other 
of the two sexes, with a given age-group, or with a given 
occupation. Again, the classifications of deaths and of occupations 
are repeated at successive intervals of time ; and if they have 
remained strictly the same, it is also possible to discuss the 
association of a given occupation or a given cause of death with 
the earlier or later year of observation — i.e. to see whether the 
numbers of those engaged in the given occupation or succumbing 
to the given cause of death have increased or decreased. But 
in such circumstances the greatest care must be taken to see 
that the necessary condition as ta the identity of the classifications 
at the two periods is fulfilled, and unfortunately it very 
seldom is fulfilled. All practical schemes of classification are 
subject to alteration and improvement from time to time, and 
these alterations, however desirable in themselves, render a 
certain number of comparisons impossible. Even where a 
classification has remained verbally the same, it is not necessarily 
really the same ; thus, in the case of the causes of death, 
improved methods of diagnosis may transfer many deaths from 
one heading to another without any change in the incidence 
of the disease, and so bring about a virtual change in the 
classification. In any case, heterogeneous classification should 
be regarded only as a partial process, incomplete until a 
homt^eneous division is introduced either directly or indirectly, 
e.ff. by repetition. 

RBFEBEHCES. 



(1) PlAKBOK, Eakl, "On the Theory of CoaHngencj and its Relation to 
Association and Normal Correlatloti," Draper^ Company Rueareh 
Memoirs, Bionutric SerUa i. ; Dulau & Co.. London, 190*. (The 
memoir in which the coeSicieat of conUngency is proposed.) ,,,,,1 , 



MANIFOLD CLA88IPICATION. 73 

(2) Lipps, G. F., "Die BeBtimmung der Abb^ogigkeit zwisoben d«ii 

Merkmalen cines Gegenstanijea," Berichie der malk.-phys. Klaaae der 
kgl. Sdcksisclunt Gesellsdtafl der fVUstnache^en ; Leipzig, 1905. (A 
general dmusaion of the prublems of usociation and coDtingencj. ) 

(3) PsAttBON. Eakl, " 0ns Coefficient of Glass Eettrogeneitj or Divergence," 

Siojaetri/ca, vol, v. p. 188, 1908. (An application of the contingency 
coefficient to the meaBUTsment of heterogeneity, t.g. in different 
districts of a country, by treating the observed frequencies of some 
quality A^, A, .... An in the different districts as rows of a con- 
tingency table and working out the coefBcient: the same principle is 
also applicable to the comparisoo of a single district with the rest of 



moUj.) 



Isotrop^. 



(4) YULB, G. U., "On a Property which holds good for all Groupings of a 
Normal Distribution of Frequency for Two Variables, with applications 
to the Study of Contingency 'I'ablea [or the Inheritance of Unmeasured 
Qualities," JProc. Soy. Soc, Series A, toI. liiviL, 1900. p. 324. (Oii 
the property of isotrojiy and some auiilications. ) 

<S) YiTLK, G. U., "On the InSuence of Bia.s and of Personal Equation in 
Statistics of Ill-defined Qualities," Jour, of the Anthrop. Inst., 
vol, xixvi., mOd, p. 325. (Includes an investigation as to the influence 
of hias and of jiersonal equation in creating divergences from iaotropy 
in contingency tables.) 

OontiugeacT Tables of two Rows only. 

(8) Pbabson, Kabl, " On a New Method of DetermiDiag Correlation between 
a Measured Character A and a Character B of which only the Percentage 
ofCaseswherein£ezceeds(or tails short of)agiven Intensity ia I'ecorded 
for each Grade o( A," BKHnrtrika, vol. vii., 1909, p. 96. (Deals with a 
measure of dependence for a common type of table, e.g. a table showing 
the numbers of candidates (vho pasted or foiled at an examination, for 
each year of age. The table of such a type stands between the con- 
tingency tables for unmeasured characters and the correlation table 
(chai^ ii.) for variables. Pearson's method is baaed on that adopted 
for the correlation table, and assumes a normal distribution of Sre- 
quency (chap, xv.} for B.j 

(7) Fbarson, ILAltL, "On a New Method of Determining Correlation, when 
one Variable is given by Alternative and the other by Multiple 
Categories," Biometrika, vol. vii., 1910, p. 248. (The similar 
problem for the case in which the variable ia replaced by an un- 
measured quality.) 



(1) (Data from Karl Pearson, " On the Inheritance of the Mental and Moral 
Characters in Man," Jaur. oflM AiUliTop. Inst., vol. xxiiii. , and BioTnetrika, 
vol. iii.) Find the coefficient of continganoy (coefficient of mean square 
contingency) for the two tables below, showing the resembiance between 
brothers for athletic capacity and between sisters for temper. Show that 
neither table is even remotely isotropic. (As stated in § 7, the coefficient of 
contingency should not as a rule be used for tables smaller than 6 x 5-fold : 
these small tables are given to illustrate the method, while avoiding lengthy 
arithmetic.) tiojr 



THEORY OF STATISTICS 



c Capacity. 
First Broiiier. 





itt...... 


Betwixt 


Non- '. 
ftthletio. 


Total. 


AtHetio 
Betwiit 

Non-athletJc . 

ToUl . 


B06 
20 

110 


20 
76 
9 


140 

s 

870 


1066 
105 
519 


1066 


105 


5IS 


1090 



B. Tehpbb. 

Fira Sitter. 





Qoick. 


Oood- 


Sullen. 


TotaL j 


Good-natured 

Sullen .... 

Total 


198 
177 
77 


177 
996 

186 


77 
185 
120 


462 
1838 

862 


462 ' 1338 


362 


2152 



n,gN..(jNGoogle 



PART II.— THE THEORY OF VARIABLES. 

CHAPTER Vr. 

TEE FEEanENOYBISTEIBnTION. 

1. iQtrodoctoiy — 2. Necessity for clasai£cation of observations: the frequency 
distribution — 3. Illustrations — i. Method of forming the toWe — 6. 
Magnitude of olass-in tergal — 6, Poaition of intervals— 7. Process of 
clsBsilication — 8. Treatment of intermediate observations— B. Tabula- 
tion— 10. Tables with unequal intervals— 11. Graphical represento- 
tioa of the freijuency-distribution— 12. Ideal frequency- distributions 
— 13. The symmetrical diatributian — 14. The moderately asymmetri- 
cal distribution — 15. The extremely asymmetrical or J-shaped dis- 
tribntion— 16, The U-shaped distribution. 

1. The methods described in Chaps. l.-V. are applicable to all 
observations, whether qualitative or quantitative ; we have now 
to proceed to the conaideration of specialised processes, adapted 
to the treatment of quantitative measurements, but not generally 
available, eicept by the aid of more or less artificial hypotheses, 
for the discussion of purely qualitative observations. Since 
numerical 'Measurement is applied only in the case of a quantity 
that can present more than one numerical value, that is, a varying 
quantity, or more shortly a variable, this section of the work may 
be termai the theory of variables. As common examples of such 
variables that are subject to statistical treatment may be cited 
birth- or death-rates, prices, wages, barometer readings, rainfall 
records, and measurements or enumerations {e.ff. of glands, spines, 
or petals) on animals or plants. 

2. If some hundreds or thousands of values of a variable have 
been noted merely in the arbitrary order in. which they happened 
to occur, the mind cannot properly grasp the significance of the 
record : the observations must be ranked or classified in some 
way before the characteristics of the series can be comprehended, 
and those comparisons, on which arguments as to causation 
depend, can be made with other series. The dichotomous^^^a^aai- 



76 THEORY OF STATISTICa. 

ficabion, congidered in Chaps. I.-IV., is too crude : if the values are 

merely classified as A'a or a's accordiag as they exceed or Fall 
short of some fixed value, a lai^ part of the iaforniation given 
by the original record in lost. A manifold classification, however 
{cf. Chap, v.), avoids the crudity of the dichotomoua form, since 
the classes may be made as numerous as we please, and numerical 
measurements lend ~{bsmselves with peculiar readiness to a 
manifold classification, for the class limits can be conveniently 
and precisely defined by assigned values of the variable. For 
convenience, the values', of the variable chosen to define the 
Bucoessive classes should be equidistant, so that the numbers of 
observations in the different classes (the claSB-fVequencies) may be 
comparable. Thus for measurements of stature the interval 
chosen for classifying (the class-mterval, as it may he termed) 
might be 1 inch, or 2 centimetres, the numbers of individuals 
being counted whose statures fall within each successive inch, or 
each successive 2 centimetres, of the scale ; returns of birth- or 
death-rates might be grouped to the nearest unit per thousand 
of the population; returns of wages might be -elassitied to the 
nearest shilling, or, if desired to obtain a more condensed table, 
by intervals of five shillings or ten shillings, and so^on. When 
the variation is discontinuous, as for example in enumerations 
of numbers of children in families or of petals on flowew, the 
unit is naturally taken as the class-interval unless the range of 
variation is very great. The manner in w hich, the observations 

a^ di fi ljj ljiif.pd ()v ^x the s u ccessive equal i n terval s of the' scale is 

Bpnkpn nf iiH thp fr equen cY-dJatribution o f the variabl e. 

3. A few illustrations will make clearer the nature" of such 
frequency-diatrihiitions, and the service which they render in 
summarising a long and complex record ; — 

(d) Table I. In this illustration the mean annual death-rates, 
expressed as proportions per thousand of the population per 
annum, of the 632 registration districts of England and Wales, 
for the decade 1881-90, have been classified to the nearest unit ; 
i.e. the numbers of districts have been counted in which the 
death-rate was over 12*5 Ifttt iinder 13-5; over 13-5 but under 
li-5, and so on. The frequency-distribution is shown by the 
following table. 



„..„„,Ch [?>»"'■ 



THE FRKQUENCY-DISTBIBUTION. 



Table I. — Sboining the Ifunibera of Registration Dialrieta in England and 
H'aies with Differtitl moin Death-Totu per Thotnand of the Pupulalitm 
per Annum/or the Ten Years 1881-SO. (Mateiisl from the Supplement 
to the 55th Annual Beport of the ftegidrar-Qetural for Sngland and 
H'ai«[(7.-776B]1835) 



Mean Animal 
Death-rate. 


Number of 
Districta with 

Death-rate_ 

between Limiki 

stated. 


Mean Annual 
Death-rate. 


Number of 
Diatriots with 

Death-rate 

between Limita 

atat«d. 


12-6-13-5 
lS-6-14-6 
U-B-16-6 
16-5-]66 
16 -6-17 -6 
17-&-18 6 
IS 6-18-6 
19 -5-20 -6 
20 ■5-21-5 
31 ■5-22-6 
22-6-23-6 


6 
16 
6! 
112 
159 
104 
67 
42 
25 
18 
S 


28-5-21 -6 
2* -6-26 '6 
25 6-28-5 
26-5-27-6 
27-5-28-6 
28-6-29-6 
29-6-80-5 
30 -5-31 -5 
31-6-32-6 
32-5-33-5 


6 

1 
2 

2 

i 


Total 


632 



Whilst a glance through the 
any very definite improssion, owi 
differences between the death-rates 
inapecti 
points. 



iginal returns fails to convey 

to the large and erratic 

iceessive districts, a brief 

of the above table brings out a number of important 

Thus we see that the death-rates range, iu round 



numbers, from 13 to 33 per thousand per a 
great majority of diatricts lie nearer the lower limit than the 
upper; that the death-rates in some 60 per cent, of the districts 
lie within the uan-ow limits 15'5 to 18*5, the rates being most 
frequent near 17 per thousand, and bo forth. 

(b) Table II. The i^es at death, in years, of the married 
women in certain Quaker families were recorded and classified in 
5-year groups according as they were over 17-5 but under 22-5, 
over 225 but under 27-5, and so on. The frequency -distribution 
was as follows : — 



„ [Tablb II. 

n,gN..(jN<jOogre 



78 THEORY OP STATISTICS. 

Table II. — Showing the ij/'jimbera of Marritd Women, in certain Quak&r 
FamUiis, I>ying at Different Ages. (Cited from Proc lioy. Soc., vol. Ixvii, 
(1900), p. 172. On the CoTrelaiion hetween Duration of Life and Nv/mber 
of Ofspring, b; Misa M. Baeton, Karl Pearson, and G. U. Yola.) 



Age at Death, 
Years. 


Number of 

Women Dying 

between 

said Years 

of Age. 


Age at Death, 
Years. 


Number of 
Woirien Dying 

saidY^re 
of Age. 


17 ■6-22-6 
82 '6-27-6 
27-5-32-6 
32-5-37-6 
37-5-42-5 
*2-5-47-5 
47-6-52'6 
62-6-67-6 
67-6-82-5 


29 

87 

ee 
iw 

99 
87 
84 
64 
69 


62-6- 67-6 
67-5- 72-6 
72-5- 77-5 
77-6- 82-6 
82-5- 87-6 
87-6- 92-5 
92-6- 97-5 
87 -6-102 -5 


78 

77 
78 
69 
26 
7 
4 


Total 


1096 



The distribution is somewhat more irregular thau iu the last 
caae; the oommeucenieiit is abrupt; a maiimum frequency is 
attained in the fourth class (age at death 32-5 to 37-5), and then 
there is a slow fall to the age-class 52'5-57'5. After this class 
the frequency rises again and attains a secondary maximum in 
the age-class 67-5-72-5. 

(c) Tahle III. The numbers of etigmatio rays on a number 
of Shirley poppies were counted. As the range of variation is 
not great, the unit is taken as the class-interval. The frequency- 
distribution is given by the followiug table. 

Table lll.—Showing lite Frequeneies of Setd CaptuUi on certain SMrUy 
Poppies, with DiffereiU Nutt^erB of Stigmatic Bays. (Cited from 
Biometi-ika, ii. p. 89, 1902.) 





Number of 




Number of 




Capaules 
widTiaid 




Capsules 
wit^said 


Stkmatic 


Stigniatic 
Kays. 


Number of 


Number of 




Stigmatic Rajs 




StigmatioRays. 


g 


3 


14 


803 


7 




16 


234 


8 


38 


18 


128 


9 


106 


17 


50 


10 


152 


18 


19 




233 


IS 




12 


306 


20 


1 


18 


315 






Total 


1906 



Google 



THE PREQUENCT-DI8TRIBUTI0N. 79 

The numbers of raya range from 6 to 20, — 12, 13, or 14 rays 
being the moat neual. 

4. To expand slightly the brief description given in § 2, tables 
like the preceding are formed in the following way; — (1) The 
mi^nitude of the class-interval, i.e. the number of units to each 
interval, is first fixed ; one unit was chosen in the coae of Tables 
I. and III., five units in the case of Table II. (2) The position or 
origiQ of the intervals must then be determined, e.g. in Table I. 
we must decide whether to take as intervals 12-13, 13-14, 14-15, 
etc., or 12-5-13-5, 13'5-14-5, 14-5-15'5, etc. (3) This choice 
having been made, the complete scale of intervals is 6xed, and the 
observations are classified accordingly. (4) The process of 
classification being finished, a table is drawn up on the general 
lines of Tables l.-lll., showing the total numbers of observatioos 
in each class-iutervsL Some remarks may be made on each of 
these heads. 

5. MagnitVide of Class-Interval. — As already remarked, in cases 
where the variation proceeds by discrete steps of considerable 
magnitude as compared with the range of variation, there is very 
little choice as regards the magnitude of the class-interval. The 
unit will in general have to serve. But if the variation be 
continuous, or at least take place by discrete steps which are 
small in comparison with the whole range of variation, there is 
no such natural class-interval, and its choice is a matter for 
judgmenL 

The two conditions which guide the choice are these : (a) we ix 
desire to be able to treat all the values assigned to any one class, 
without serious error, as it they were equal to the mid-value 
of the class-interval, e.ff. as if the death-rate of every district in 
the first class of Table I. were exactly 13'0, the death rate of 
every district in the second class 140, and so on; (6) for con- */ 
venience and brevity we desire to make the interval as large as 
possible, subject to the first condition. These conditions will 
generally be fulfilled if the interval bo so chosen that the whole 
number of classes lies between 15 and 25. A number of classes 
less than, say, ten leads in general to very appreciable inaccuracy, 
and a number over, say, thirty makes a somewhat unwieldy 
table. A preliminary inspection of the record should accordingly 
be made and the highest and lowest values be picked out.- 
Dividing the difference between these by, say, five and twenty, we 
have an approximate value for the interval. The actual value 
should be the nearest integer or simple fraction. 

6. Position of Intervals. — The position or starting-point of the 
intervals is, as a rule, more or less indifferent, but in general it 
is fixed either so that the limits of intervals are integers, or, as in 



80 THEORY OF STATISTICS. 

Tables I. and IT., so that the mid-values are integers. It may, 
however, be chosen, foi simplicity in classification, so that no 
limit corresponds exactly to any recorded value (c/. g 8 below). In 
some exceptional cases, moreover, the observations exhibit a marked 
clustering round certain values, e.g. tens, or tens and fives. This 
is generally the case, for instance, in age returns, owing to the 
tendency to state a round number where the true age is unknown. 
Under such circumstances, the values round which there is a 
marked tendency to cluster should preferably be made mid-vatuea 
of intervals, in order to avoid sensible error in the assumption 
that the mid-value is approximately representative of the values 
in the class. Thus, in the case of ages, since the clustering is 
chiefly round tens, " 25 and under 35," "35 and under 45," etc., the 
classification of the English census, is a better grouping than " 20 
and under 30," " 30 and under 40," and so on. Where there is 
any probability of a clustering of this kind occurring, it is as well 
to subject the raw material to a close examination before finally 
fixing the classification. 

7. Glatiifleati&n, — The scale of intervals having been fixed, the 
observations may be classified. If the number of observations is 
not large, it will be sufficient to mark the limits of successive 
intervals in a column down the left-hand side of a sheet of paper, 
and transfer the entries of the original record to this sheet by 
marking a 1 on the line corresponding to any class for each entry 
assigned thereto. It saves time in subsequent totalling if each 
fifth entry in a class is marked by a diagonal across the preceding 
four, or by leaving a space. 

The disadvantage in this process is that it ofiers no facilities for 
checking : if a repetition of the classification leads to a different 
result, there is no means of tracing the error. If the number of 
observations is at all considerable and accuracy is essential, it is 
accordingly better to enter the values observed on cards, one to 
each observation. These are then dealt out into packs according 
to their classes, and the whole work checked by running through 
the pack corresponding to each class, and verifying that' no cat^s 
have been wrongly sorted. - 

8. In some cases difficulties may arise in classifying, owing to 
the occurrence of observed values cgrresponding to class-limits. 
Thus, in compiling Table I., some districts will have been noted 
with death-rates entered in the Registrar-General's returns as 
16-5, 17'5, or 18-5, any one of which might at first sight have 
been apparently assigned indifferently to either of two adjacent 
classes. In such a case, however, where the original figures tor 
numbers of deaths and population are available, the difficulty may 
be readily surmounted by working out the rate to another place 



THE FRBQUBNCY-DI8TRIBUTI0N. 



15'5-16'5. Death-rates that work out to half-unite exactly do 
not occur in this example, and so there is no real difBculty. In 
the case of Table II., again, there is no difficulty : if the year of 
birth and death alone are given, the t^e at death is only calcul- 
able to the nearest unit ; if the actual day of birth and death be 
cited, half-years still cannot occur in the age at death, because 
there is an odd number of days in the year. The difficulty may 
always be avoided if it be borne in mind in fixing the limits 
to class-intervals, these being carried to a further place of decimals, 
or a smaller fraction, than the values in the original record. Thus 
if statures are measured to the nearest centimetre, the class- 
intervals may be taken as 150'5— 161-5, 151'5-152'5, etc.; if to 
the nearest eighth of an inch, the intervals may be 59^J-60^, 
60^g— 61^1, and so on. 

If the difficulty is not evaded in any of these ways, it is 
usual, to assign one-half of an intermediate observation to each 
adjacent class, with the result that half-unita occur in the 
class-frequencies (c/. Tables VII., p. 90, X., p. 96, and XT., 
p. 96). The procedure is rough, but probably good enough for 
practical purposes ; it would be slightly better, but a good deal 
. more laborious, to assign the intermodiate observations to the 
adjacent classes in proportion to the numbers of other obsen'ations 
falling ufto the two classes. 

9. TaAv^tion. — As regards the actual drafting of the final 
table, there is little to be said, except that care should be taken 
to express the class-limits clearly, and, if necessary, to state the 
manner in which the difficulty of intermediate values has been 
met or evaded. The class-limits are perhaps best given as in 
Tables I. and II., but may be more briefly indicated by the mid- 
values of the class-intervals. Thus Table I. might have been 
given in the form — 

Dsftth-rate per 1000 Number of 

per annimi to the Districts with 

Nearest Unit. said Death-rate. 



A common mode of defining the class-intervals is to state the 
limits in the form " x and less than y." In the case of measure- 
ments of stature, for example, the table might run — , 



■.CH(t.lJ^le 



THEORY OF STATISTICS. 



Stature in Incites. 



57 and lesa than 6 

58 „ „ B 



— the statement " 67 and lees than 58," etc., being often ftbbreviated 
to 57-, 58-, 59-, etc. (c/. Table VI,, p. 88). The mode of grouping 
is, ia effect, that described in the last paragraph aa of eervice in 
avoiding intermediate ohaervations, but it should he noted that the 
form of statement leaves the class- limits uncertain unless the degree 
of acouracy of the measurements ia also given. Thus, if measure- 
ments were taken to the nearest eighth of an inch, the class- 
limite are really 56y|— 57^f-, 57-J-|~58-J4> ^to. ; if they were 
only taken to the nearest quarter of an inch, the limite are SGJ- 
-57J, 575~B85, etc. With such a form of tabulation a state- 
ment as to the number of significant figures in the original 
reoord is therefore essential. It is bettor, perhaps, to state the 
true class-limits and avoid ambiguity. 

10. The rule that class-iutervals should be all equal is one 
that ia very frequently broken in official statistical publications, 
principally in order to condense an otherwise unwieldy table, 
thus not only saving space in printing hut also considerable 
expense in compilation, or possibly, in the case of confidential 
figures, to avoid giving a class which would contain only one or 
two observations, the identity of which might he guessed. It 
would hardly be legitimate, for example, to give a return of 
incomes relating to a limited district in such a form that the 
income of the two or three wealthiest men in the district would 
be clear to any intelligent reader with local knowledge. If the 
intervale be made unequal, the application of many statistical 
methods is rendered awkward, or even impossible, and the 
relative values of the frequencies are at first sight misleading, so 
that the t^bie ia not perspicuous. Thus, consider the first two 
columns of Table IV., showing the numbers of dwelling-houses 
of different annual values, assessed to inhabited house duty. On 
running the eye down the column beaded " number of houses " it 
is at once caught by the two striking irregularities at the classes 
"£60 and under £80," and ".£100 and under £150." But these 
have no real significance ; they are merely due to changes from 
a £10 to a £20, and then to a £50 interval. Moreover, the 
intervals after £150 go on continuously increasing, but attention 
is not directed thereto by any marked changes in the frequencies. 
To make the latter really comparable inter »e, they must first be 



THE FEttQDENCY-DISTBIBUTIOK. 83 

Tablx rV.—Shtnoiii^ the AttTiuat Value tmd JVumkr of DioelUng-hoiuei in 
Great Britain aaeeied to IiOiaiited Sotae Duty in 188B-S. (Cited from 
Joar. Soy. Slat. Soc, vol. I. 1887, p. 810.) 



Annuri Value in £'a. 


Nnmb«r 
of Hoiuea. 


Frequency 
per £10 
Interval. 


£20 and nuHet £Z<i 
SO „ 40 
10 „ 50 
60 „ 60 
80 „ 80 
80 „ 100 
100 „ 160 
160 „ 800 
300 „ BOO 
600 „ 1000 
1000 »Dd upwards 

Total namW of htmats 


SOS, 108 

182,072 
106.407 

S3,ose 
7i,ise 

32,896 
41,330 

26,732 

6,198 

2,098 

844 


306,408 

182,972 

106,407 

63,096 

86,718 

16,182 

8,267 

1,782 

810 

42 


838,692 





reduced to a common interval as baaia, e.^. £10, by dividing the 
fifth and sistli numbers by 2, the seventh bj 5, the eighth by 16, 
and ao on. This givea the mean frequeocies per XIO interval 
tabulated in the third columii of Table IV.- The reduction is, 
however, impoBBible in the case of the last claes, for we are only 
told the number o! bousea of XIOOO annual value and upwarda : 
the magnitude of the claaa ia indefluite. Such an indefinite class 
is in many respects a great inconvenience, and should alwaya be 
avoided in work not subject to the neceasary limitationa of 
official publications. 

The general rule that intervals should be equal must not be 
held to bar the analysis by smaller equal intervals of some 
portion of the range over which the frequency varies very 
rapidly. In Table XII., p. 98, for example, giving the numbers 
of deaths from diphtheria at successive ages, a five-year interval 
might be substituted with advantage for the irregular intervals 
after the fifth year of age, but it would atill be desirable to give 
the numbers of deaths in aach year for the first five years, so as 
to bring out the rapid rise to the maximum in the fourth year 
of life. 

II. When the table has been completed, it is often convenient 
to represent the frequency-distribution by means of a dif^ram 
which conveys the general run of the observations t« the eye 
better than a column of figures. The following short table, 



THBORT Of statistics. 



giving the distribution of head-breadtha for 1000 men, will serve 
as an example. 



Table V.—Shoicmg the Frequeticg-diilribiition of Htad-brtadlhi for SlwdeiUs 
ai Can^rridgc. Meaxarim^iUa takm to the Tieanat itrUh of an ineh, 
(Cited from W. R. Maodonell, Biometrika, L, 1S02, p. 220.) 



Hnd-breiuith 
in Inches. 


Nnmber of 
Men with uid 
Heod.breadth. 


Head-breadth 
in Incbea. 


Number of 
Men with <»M 
Hoad-breadth. 


5-6 

B-e 

67 
6-8 
6-9 
8 
8 1 

8-a 


8 

la 

4S 
80 
131 
236 
185 
142 


e-3 

(I -4 
8-6 
8-6 
67 
a '8 


90 
37 
15 
12 
3 
2 


Tot.1 


1000 



Taking a piece of squared paper ruled, say, in inches and tenths, 
mark off along a horizontal base-line a scale representing olass- 
intervala ; a half-inoh to the class-interval would be suitable. 
Then choose a vertical scale for the class-frequencies, say 50 
observations per interval to the inch, and mark off, on the 
verticals or ordinata through the points marked 5'5, 5"6, 5"7 
.... at the centres of the class-intervals on the base-line, heights 
representing on this scale the class-frequencies 3, 12, 43. . . , 
The diagram may then be completed in one of two ways: (1) 
as a frequency polygon, by joining up the marks on the ver- 
ticals by straight lines, the last points at each end being joined 
down to the base at the centre of the next class-interval (fig. 1) j 
or (2) as a column diagram or histogram (to use a term sug- 
gested by Professor Pearson, ref, 1), short horizontals being drawn 
through the marks on the verticals (fig. 2), which now form the 
central axes of a series of rectangles representing the class- 
frequencies. The student should note that in any such diagram, 
of either form, a certain area represents a given number of 
observations. On the scales suggested, 1 inch on the horizontal 
represents 2 intervals, and 1 inch on the vertical represents 50 
observations per interval : 1 square inch therefore represents 
50 X 2 = 100 observations. The diagrams are, however, con- 
ventional : the whole area of the iigure is correct in either case, 
but the area over each interval is not correct in the case of the 
frequency-polygon, and the frequency of each fraction .of any 



THE FREQUBNCf-DiaiBlBTmON. 



1 












































■/ 


\ 


















i ' 












/ 


) 


















■^rn 












1 




\ 
















1 


















s 














1™ 










/ 








\ 














i 










/ 










\ 


















/ 












\ 


















/ 














\ 


N 










1 










. 












1 




-^ 







-» 60 -1 ■« -3 -4 -5 -6 -7 
Bead, breadth, in. inches 
Fio. I.— FreqUBLOy-Polygon for Head-breadths of 1000 Cojabridge 
"—'--*- (Table V.) 



r 




J--, 










1 














f 






--, 






















\ ' 






--1 








% , 














i 














i 














I 




p- 


--|_ 










r- 




Ik 







! -7 -a ■e &0 i -2 -3 ■# -3 -6 7 fl 
Head, breaiWi. in- inche*. , Ciotwic 

FlO. 2.— Histogram for the same data as Fig. 4. " 



THEORY OF STATISTICS. 



interval is not the same, as suggested by the histogram. The 
area shown by the frequency-polygon over any interral with an 
ordinate yj (fig. 3) is only correct if the tops of the three 



J 



9 ordinates y,, j/j, ^3 lie on a line, i.e. if yi = ^{yi + )/i)y 
the areas of the two little triangles shaded in the figure being 
equal. If y^ fall short of this value, the area shown by the 



polygon is too great; if y^ exceed it, the area shown by the 
polygon is too small; and if, for this reason, the frequency- 
polygon tends to become very misleading at any part of the 
range, It is better to use the histogram. In the mortality db- 
tnbution of Table I., for instance, the frequency rises so sharply 



THE FBEQUBNCT-DI8TBIB0TI0N. 87 

to the maiimtim that a biBtogram is, od the whole, the better re- 
preseatatioQ of the distribution of frequency, and in 8uob a 
distribution as that of Table IV, the use of the histogram is 
almost imperative. 

12. If the class-interval be made smaller and smaller, and at 
the same time the number of observations be proportionately in- 
creased, so that the class-frequencies may remain finite, the 
polygon and the histogram will approach more and more closely 
to a smooth curve. Such an ideal limit to the frequency-polygon 
or histogram is termed a frequency-curve. In this ideal frequency- 
curve the area between any two ordinates whatever is strictly 
proportional to the number of observations falling between the 
corresponding values of the variable. Thus the number of 
observations falling between the values x^ and x^ of the variable 
in fig. 4 will be proportional to the area of the shaded strip is the 
figure; the number of observed values greater than x^ will 
similarly be given by the area of the curve to the right of the 
ordinate through x^, and so on. When, iu any actual case, the 
number of observations is considerable — say a thousand at least 
— the run of the class-frequencies is generally sufficiently 
smooth to give a good notion of the form of the ideal distri- 
bution; with small numbers the frequencies may present all 
kinds of irregularities, which, most probably, have very little 
significance {cf. Chap. XV. g 15, and g 18, Ex. iv.). The forms 
presented by smoothly running sets of numerous observations 
present an almost endless variety, but amongst these we notice 
a small number of comparatively simple types, from which many 
at least of the more complex distributions may be conceived as 
compounded. For elementary purposes it is sufficient to consider 
these fundamental simple types as four in number, the symmetri- 
cal distribution, the moderately asymmetrical distribution, the 
extremely aaymmatrioal or J-shaped distribution, and the U-shaped 
distribution. 

13. The symmttrtcal dUtributum, the class-frequencies decreas- 
ing to zero symmetrically on either side of a central maximum, 
Fig. 5 illustrates the ideal form of the distribution. 

Being a special case of the more general type described under 
the second heading, this form of distribution is comparatively rare 
under any circumstances, and very exceptional indeed in economic 
statistics. It occurs more frequently in the case of biometric, more 
especially anthropometric, measurements, from which the following 
illustrations are drawn, and is important in much theoretical work. 
Table VI. shows the frequency-distribution of statures for adult 
males in the British Isles, from data published by a Briti^ 
Association Committee in 1883, the figures being given separately 



THBMKr OF STATISTICS. 



Table VL— Showing the Freqveiiey-dulTHiutiom of Stature) far Adult 
MaU) horn in England, Ireland, Scotland, and Walts. Finai Report of 
the Anthropometric CommiUee to the Britiah Aaaociation. {Report, 1888, 
p. 250.) A» Measurements are stated to have been taken to the nearest 
jlth of an Inch, the Clats-Interoale are here presumably 56ij-571^i. 
67H-S8H, and so onX^if. 8 »)■ 3" Fig- A- 





Nnmb«r of Hen irithin Bud Limite of Heiglit. 






Place of Birth— 




Height without 
«ho«a, Inoh«e. 




Total. 










Engluid. 


SootUud. 


Wale*. 


iToUnd. 




67- 


I, 




« 1 




2 


6S- 


8 








4 


69- 


12 




1 


1 


14 


60- 


89 


2 






41 


61- 


70 


2 


9 


2 


83 


62- 


128 


» 


30 


2 


169 


68- 


820 


19 


48 


7 


891 


■ 64- 


•»624 . 


47 


83 


16 


669 


65- 


740 


10&- 


108 


38 


990' 


86- 


881 


13S 


115 


68 


1228 


67- 


918 


210 


128 


78 


1S2S 


68- 


886 


210 


72 


62 


1230 


69- 


763 


218 


52 


10 


1063 


70- 


473 


116 


33 


26 


648 


71- 


261 


102 


21 


16 


392 


72- 


117 


69 


6 


10 


202 


78- 


18 


28 


2 


8 


79 


71- 


16 


16 


1 




82 


7S- 


9* 


6 


1 




16 


76- 


. K 


4 




_ 


6 


77- 
Total 


' 1 


I 


. — 


— 


2 


61S1 


1304 


741 


316 


858S 



tor poraoDB bom in England, Scotland, Wales, and Ireland, and 
totalled in the last column. These frequency-distributions are 
approsimately of the symmetrical type. The frequency-polygon 
for the totals given by the last column of the table is shown 
in fig. 6. The student will notice that an error of yg- inch, 
scarcely appreciable in the diagram on its reduced scale, is neglected 
ID the scale shown on the base-line, the intervals being treated 
as if they were 57-58, 58-59, etc. Diagrams should be drawn for 
comparison showing, to a good open scale, the separate distributions 
for England, Scotland, Wales, and Ireland. ( 'imioIc 



THK FRKQUKHCY-DISTEIBUTION. 



Fia. 5. —An ideal BTmrnetricil Frequencj-distributioD. 



_ 


Itm • - /- \ 




^ ___::::""/" 1 


i^ ^ J - - — 




1 






i :::::::/ r 


|m. ' ^ 




1 7 ^ 


■ I 60 62 64 66 68 70 72 74 76 73 W 



90 THEORY OF STATISTICS. 

Table VII. gives two similar distributions from more recent 
investigations, relating respectively to sons over 18 years of 
age, with' parents living, in Great Britain, and to students at 
Cambridge. The polygons are shown in figs. 7 and 8. Both these 
distributions are more irregular than that of fig. 6, but, roughly 
speaking, they may all be held to be approiimately symmetrical, 

14, The moderately atymmetrical dutriimtion, the claas-fre- 
quenciea decreasing with markedly greater rapidity on one side of 
the maximum than on the other, as in fig. 9 (a) or (6). This is 
the most common of all smooth forms of frequency-distribution, 
illustrations occurring in statistics from almost every source. The 
distribution of death-rates in the registration districts of England 



Table VII.— SAoict'bk the Frequew^-dittriinUiim of Slaiurta for (1) 1078 
Stigliah Sons (Karl Pearson, Siometr^ca, it, 1908, p. 415) j {2) for 1000 
Afale SlvdetUs at Cambridge {W. E. Macdonell, Biometrika, L, 1902, 
y. 220). See Figs. 7 and 8. 





Number of M 


n within said 




Limits of suture. 


Stature in 










Inches. 




Cambridge 
Studenta. 




(1) 
English Sons. 


69'5-80-B 


2-0 


_ 


60-E-Bl'E 


1-6 




61-6-e2-E 


3-5 


4-0 


B2-5-eS-6 


20 -B 


19-0 


83 -6-6* -E 


38-5 


24-6 


64-5-a6-5 


81-6 


40-5 


86'6-BS-6 


8a-E 


84-5 


86 '6-67 -6 


148-0 


128 6 


87 -6-68-6 


173-6 


189 -0 


68-6-69-B 


149-5 


179-0 


6B -6-70-6 


128-0 


138-6 


70-S-71-6 


108-0 


108-0 


71 ■5-72-6 


63-0 


58 -5 


72 -6-73 -5 


42-0 


47-6 


73-6-7* -6 


29-0 


31-0 


7* ■6-76 '5 


8-5 


12 '0 


75-6-78-B 


*-o 


6-0 


76-6-77-6 


4'0 


0-6 


77-6-78-5 


3-0 




78-5-79'B 
Total 


0-5 


— 


1078 


1000 



jOo»^Ic 



THE FRBQDBNCT-DISTRIBnTION. 



/\. 



E 


l»< yi 


% f\ 


C ^""r:::::::::: 


1"' 7 ^ 


i:::::::::: ::::r::::::___ 


:. t~ — i 


i" t 


1" 7 \ 


r::::;/:::::::::::!,...... 



Slalare in. inches. 

Fio 8 —Freqnency-digtributdoii of Stature for 1000 Cambridge 
Stadenta. (Table VIL) 



THKOEV OP STATISTICS. 



and Wales, given in Table I., p. 77, is a Bomewhat rough example 
of the type. The distribution of rates of pauperism in the same 




—Ideal distiibutiona of the moderately aajtumstrical farm. 



diBtricte (Table YIII. and fig. 10) is smoother and more like the 
type (a) of &g. 9. The frequency attains a maximum for 





„ /\. _ 


« _ /__\ __ _ 


I ^ \ 


--J::::V::"::: 


___[ \ 


I J \ 


3i t \ 




Z'l "^ 


::t::::::::::i:::::::: 



Pej-centage oF the population, in, reaipC of relief. 
Flo. 10. — FrflquencV-distribution of Pauperism ( PeroentagB of the PopaUtioii 
in Kooeipt of Poor-law Relief) on Ist January 1891 in the KegiBtcation 
DiatriotBof England and Wales: 632 Distriota. (Table VIII.) ^^(sje 



THE FREQUENCT-DISTBIBOTION. 93 

diatriota with 2^ to 3^ per cent, of the population in receipt of 
relief, and then tails off slowly to unions with 6, 7, and 8 per 
cent, of pauperism. 



Table YllL—SAmning the Number of RegUtrcUion DUtricta in England and 
Wales wilh Diffeifjii Percentages of the Population in receipt of Poor-law 
Relief on the 1st January 1891. (Yale, Jour, Soy. Stat, See., vol lii., 
18S6, p. 347, q.v. for distributiona for entUer yeare. ) See Fig. 10. 



PereeBtage of 

tho Population 

in receipt of 

Relief 


Number of 
[TniODB with 
given Percent- 
age in receipt 
of Belief. 


8 -76-1 -26 
1 -2^-76 

1-75-2 -26 
9-26-2 -76 
876-8-25 
3-3E4S-76 
8 -76-4 -25 
4-26-476 
4-75-B-26 
6 ■35-6 76 
5-76-e"26 
8 -26-6 -76 
875-7-26 
J-26-7-76 
7i-76-8-25 
8-25-876 

Total 


18 

48 
72 
SB 
100 
90 
76 
60 
40 
21 
11 


632 



C.F 



?:,- 



While the distribution of stature is in general symmetrical, that 
of weight ia asymmetrical or tiew, the greater frequencies lying 
towards the lower end of the range. This is shown very well by 
the data (Table IX. and fig. 11) collected by the same British 
Association Committee, from the Report of which the data as to 
stature were cited in the last section. As in the case of the stature 
diagram (fig. 6), the small error of J lb. has been neglected, for 
the sake of brevity, in lettering the base-line of fig. 11, the classes 
being treated. as if they were 90 Ib.-lOO lb., 100 Ib.-llO lb., 
and 80 on. 

Table X. and fig. 12 give a biological illustration, viz. the 
- distribution of fecundity (ratio of yearling foals produced to 
coverings) in mares. The student should notice the difficulty 



THROBT OF STATISTICS. 





IIIIMMMMII 


"" r^riMiMiiLLi 


U t 




r f 






] 


i I 


\ 


1 t 


" T 


I L 


" '\ 


, / 





BS 105 1ZS US Its las 



•XZS 34i 263 



^ir 




















y 


\ 












1., 
















/ 


/ 






\ 










^■'■' 
















/ 








\ 










1 














/ 












\ 








^ 














/ 














\ 


















/ 
















\ 






a 




U 


^ 


-; 


/ 




















\ 


^ 



Baiia af YearUng Foala prodiuxd, to coveriiufs. 

IQ. 12. — Frsqaency- distribution of Fecundity for Biood-mares : 

2000 observatJona, (T»bla X) 



THE FRKQUENCY-DISTBIBUTION. 



Table IX. —Shmeiti^ the Frequency-distrilnUion of Weights for Adult MaUa 
bom in Eaglawi, Ireland, Scotland, and iVala. [Loc. cit.. Table VI.) 
Weight! were taken to the nearest pound, amsequeiUly the true Ciati- 
Intervala are 89-B-W6l9»-5-100-5, etc. (§9). 





Number of Men within aiTBn Limits of 


1 


Weight 
inlSs. 




Weight. Place of Birth- 




Total. 












England. 


Scotlasd. 


Walei. 


Iielsnd. 






C') 


{^) 


i^) 


t^/ 


iO 


so- 


\ ■' 

2 








2 


loes 


26 


1 


2 


6 


34 


110- 


138 


8 


10 




152 


120- 


388 


22 


23 


7 


890 


130- 


694 


88 


88 


42 


8B7 


140- 


1240 


173 


153 


67 


1S28 


160- 


1076 


2G6 


178 


61 


1669 


i«0- 


881 


27G 


134 


36 


1326 


170- 


492 


168 


102 


25 


787 


180- 


304 


126 


34 


13 


476 


IBO- 


174 


87 


li 




283 


200- 


76 


24 


7 




107 


210- 


62 


14 


8 


1 


86 


220- 


33 


7 






41 


230- 


10 




2 




16 


340- 


B 


2 






11 


2B0- 


3 


4 






8 


280- 












270- 












280- 
Total 


— 


— 


1 


— 


1 


&662 


1212 


7SS 


247 


7749 



of claasification in this case : the ctaea- interval choaeu throughout 
the middle of the range is l/15th, but the last interval is 
" 29/30-1," This is not a whole interval, but it is more than a 
half, for all the casaa of complete fecundity are reckoned into the 
class. In the diagram (fig. 12) it has been reckoned as a whole 
class, and this gives a smooth distribution. 

To take an illustration from meteorology, the distribution of 
barometer heights at any one station over a period of time is, in 
general, asymmetrical, the most frequent heights lying towards the 
upper end of the range for stations in England and Wales. 
Table XI. and fig. 13 show the distribution for daily observations 
at Southampton during the years 1878-90 inclusive. 

The distributions of Tables VIII. -XI. all follow more or less the . 
type of fig. 9 {a), the frequency tailing off, at the steeper end of 



THKORY OF STATISTICS. 



Tablb X.^Shmcing Ike Frtgtie7:cy-diitribulion of FecimdUyi i.e. ift< Hatio 
of Ikt NumbtT of Ttarlitig Foaii produced to the Number of Coverings, 
for Brood- jnarea (Eaa-horaes) Covered Eight Timet at LeaM. (PearaOD, 
Lee, Hnd Moore, Phil. Triau., A, vol. cxcii. (ISSS), p. SOS.) See tig. 13. 









Number of i 




Mares with 




Marairith 


FecQudit;. 


Fecundity 
between the 


Fecundity. 


Fecundity 










Given Limita. 




Given Limits. 


1/80- 3/80 


2 


17/30-19/30 


316 










5/80- 7/80 


11-6 




293-6 


7/80- 8/30 


21-6 


23/30-25/30 


20* 


8/80-11/80 


es 


26/30-27/80 


127 


11/80-18/80 


104-6 


27/30-29/30 


49 


13/80-16/30 


1S2 


29/30-1 


18 





















Tablb XI, — Shoving the Fregueney-dislnbitHon of BaroTneter Heights for 
Daily Observationt during the Thirteen Fears 1878-1890 at Southampton. 
(K«rl Pearson »iid A. Lee. Phil. Tram., A, vol. cio. (!Sfl7), p. 428, g.v. 
r jg other distribntionE.) See Fig. 13. 





Namber of Days 
on which Height 




Number of Daye 


Height of 


Height of 


on which Height 
wu observed 






Barometer 


in Inches. 


between the 


iu luchee. 






Given Limits. 




Given Limite. 


28 -46-38 '6G 


1 


29*85- -95 


548-6 




Sft- 


W 


2 




J6-30-06 


602-6 




Ih- 


li> 


2 


S(l 


•)!.- 


15 


619-5 




7^- 


Nh 


4 




15- 


^h 


500 




'■h- 


»h 


8-5 




V5- 


<6 


382 




llV-Vfl 


16 


18-6 




.tfi- 


46 






^T- 


IB 


21-6 




4R- 


Sh 






IR- 


2B 


37 




55- 


S5 


88-5 




■M.- 


«h 


79 




rth- 


/.^> 


43-6 




■M.- 


4h 


108 




Vh- 


H5 


7 




4h- 


bh 


181-6 




Hft- 


»6 


4 




M>- 


116 


2.14-6 


30-95-31 05 






66- 
76- 


7.'^ 
85 


348-5 
468-6 








Total 


4748 



C,oo<^le 



THE FRKQDKNCY-DISTRIBDTION. 



1 


/ ^ 


: / \ """"" 


s.j« i ^ . 


\^ L \ 


|« J. 1 _ 








years of age 

Fio U — Frequency-dialribation at Deaths from Diphtheria at different Ages 
in England and Wales. 1891-1800. (Table XII.) 



98 



THEORY 0? 8TATISTIC6. 



the diatribution, in such a nay as to suggest that the ideal 
curve is tangential to the base. Cases of greater asymmetry, 
suggesting an ideal curve that meets the base (at one end) at a 
fiuite angle, even a right angle, as in fig. 9 (i), are lesB frequent, 
but occur occasionally. The distribution of deaths from diphtheria, 
according to age, affords one such example of a more asymmetrical 
kind. The actual figures for this case are given in Table XII,, and 
illustrated by fig. 14 ; and it will be seen that the frequency of 
deaths reaches a maximum for children f^;ed " 3 and under 4," 
the number rising very rapidly to the maximum, and thence 
falling so slowly that there is still an appreciable frequency for 
persons over 60 or 70 years of &ge. 

Tablk XIL—Shirwing the Numben of DtaOa from Diphtheria at Different 
Again England and Waltt during tht Ten Teart 1891-1900. {SnppU- 
menl to HUh AnnMal Report qf the Segistrar-Oeneral, 1S91-1900, p. 8.) 
See Fig. 11. 





Nnnibcr of 




Age in Yoirs. 


Deaths between 


Namber 


Given Limits 


per Annnm. 




of Age. 




Under 1 j«r 


4,186 


4,188 


1- 


10.491 


10,491 


2- 


11,218 


11,218 


8- 


12,390 


12,890 


*- 


11,164 


11,194 


6- 


28,318 


4,670 


10- 


4,0S2 


818 


16- 


1,123 


225 


20- 


686 


117 


26- 


788 


78 


86- 


612 


61 


45- 


824 


32 


66- 


280 


26 


66- 


127 


IS 


76 and apwMds 

Total 


86 




80,671 


- 



16. The exfremelff asymmetrical, or " J-ihaped," dutrilmtion, the 
clasa-frequencies running up to a maximum at one end of the 



mge, I 



. 16. 



This may be regarded as the extreme form of the last distribution, 
from which it cannot always be distinguished by elementary 
methods if the original data are not available. If, for instance, 
the frequencies of Table XII. had been given by five-year intervals 



THE PSEQDRNCr-DISTEIBUTJON. 99 

only, they would liava run 49,479, 23,348, 4,092, and bo on, 
thus suggesting a. maximum number of deaths at the b^inning 
of life, i.e. a diatribution of the present tyfic. It is only the 
analysis of the deaths in the earlier years of life by one-year 
intervals which shove that the frequency reaches a true maximum 
in the fourth year, and therefore the distribution is of the 
moderately asymrOetrical type. In practical cases uo hard and 



Tig. 1G. — An ideal Distribntion of the extreme Asjnunstrical Fonn. 

fast line can always be drawn between the moderately and 
extremely asymmetrical types, any more than between the 
moderately asymmetrical and the symmetrical type. 

In economic statiBtica this form of distribution is particularly 
characteristic of the distribution of wealth in the population at 
large, as illustrated, e.g., by income tax and house valuation returns, 
by returns of the size of agricultural holdings, and so on (cf. ref. 4). 
The distributions may possibly be a very extreme case of the last 
type ; but it the maximum is not absolutely at the lowqr ,^^)^|^he 



100 THEORT or STATianos. 

range, it iB very close indeed thereto. Official returns do not 
usually give the necessary analysis of the frequencies at the 
lower end of the range to enable the exact position of the maximvun 
to be determined ; and for this reason the data on which Table 
XIII. ia founded, though of course very unreliable, are of some 
interest. It will be seen from the table and fig. 16 th&t with the 
given claesiBcation the distribution appears clearly assignable to 
the present type, the number of estates between zero and £100 
in annual value being more than six times as great as the number 
between £100 and £300 in annual value, and the frequency 
continuously falling as the value increases. A close analysis of 
the first class suggests, however, that the greatest frequency does 
not occur actually at zero, but that there is a true maximum 
frequency for estates of about £1 15 in annual value. The 
distribution might therefore be more correctly assigned to the 
second type, but the position of the greatest frequency indicates a 



Tablb XTII.— iSAotc^'n^ the Numbcri anid AitiKual Valtta of Git Estaiei of 
ihoie tulu) had taken part m the JacobiU Riting of 1716. (Compiled from 
Ooain'B NoTnei of the Homan CMholiea, Nonjurori, and other* wlu> refueed 
to take the Oaihi to hie late Majeaty King Oterge, etc. ; London, 171S. 
Figures of very doubtful abaolute value. See a note in Southsy's 
Commtmplaee Book, vol. i. p. 57S, quoted from tha Memoirs of T, Hollu.} 
See Fig. 16. 



Annoal 
V.lM in 
;E100. 


Number of 


Annual 

Value in 
£100. 


Number of 
Estates. 


0- 1 


1726-6 


lT-^18 


1 


1- 2 


280 






2- 8 


liO'G 


20-21 




8- 4 


87 


21-22 


1 


1- C 


48'5 


22-23 




6-6 


*2-6 


23-24 


1 . 


6- 7 


29-6 






7- 8 


26-5 


27-28 


2 


S- 9 


18-6 






9-10 


21 


81-32 


1 


10-11 


11-& 






11-12 


S-6 


3S-40 




12-13 








18-1* 


!-6 


46-46 


1 


14-16 


8 






16-16 


8 


48^49 


1 


19-17 


6 






Total 


2476 








u. 



THE FfiBQUKNCT-DISTRIBUTION. 



degree of asymmetry that is high even compared with the 
aaTmmetry of fig. 14 : the distribution of Dumbers of deaths from 



16- 


] 






1 '*■ 






1 ^' 

■s 






- 






h-^__ 



diphtheria would more closely resemble the distribution of estate- 
values if the maximum occurred in the fourth and fifth weeks 
of life instead of in the fourth year. The figures of Table IV., 
p. 83, showing the annual value and number of dwelling-houses. 



THBORY OF STATISTICS. 



afford a good illuatration of thia form of distribution, but marred 
by the unequal intervals so common in official returns. 



jTHrBe SertM ijf RwiuBculna ,__. __ 

: :_•; :BafVU^:ftB4(5;fV,foraetaUB.) SseFig, 17. 



Number 
of Petals. 




Freqneuoj. 




Series A. 


Series B. 


Series C. 


10 
11 

Total 


8ia 

17 
i 
2 
2 


346 

J 

2 

2 


18S 

56 
23 
7 
2 
2 


S37 


sso 


222 



The type is not very frequent in other classes of material, but 
inetances occur here and there. Table XIV. and fig. 17 show 



Dn^ 



distributioiiB of this form for the petals of the buttercup, Jtcmurir 
eulv4 imlbonie. 

16. The U-ihaped dittribution, exhibiting a maximum fremiency 



THE FRKQUKNCY-DISTBIBOTION. 



at the ends of the range and a miDimutn towards the centre 
The ideal form of the diBtribution is illustrated by fig. 18. 




Via. 18. — An ideal DiBtribution of the IT-alisped Form. 



This is a rare but intereeting form of distribution, as it BtaadB 
in somewhat marked contrast to the preceding forms. Table XY. 
and fig. 19 illustrate au example based oa a considerable number 
of observations, viz. the distribution of degrees of cloudiness, or 
estimated percentage of the sky covered by cloud, at Breslau 





Frequency. 




Frequency. 







fl 


21 




178 


7 


71 


2 


107 


8 


194 


3 


ea 


» 


117 


i 


46 


10 


2089 


6 










Totol 


3SG3 



104 THKOKT OF STATISTICS. 

during the years 1876-85. A aky completely, or almost com- 
pletely, overcaBt at the time of observation is the most common, 
a practically clear aky cornea next, and intermediates are more 

This form of distribution appears to be Bometimee exhibited by 
the percentages of offspring posaessing a certain attribute when one 
at least of the parents also posaesses the attribute. The remarks 



2000. 






' 




I5O0. 








500. 


-^ 








^ 




n— 1— _ _,— n— 







ClouxUiLe-ss 
Fio. 19. — Freqnency-diBtribatioii of Degrees of Cloudiness at Breduu, 
1876-85: 8663 obaervationB. (Table XV.) 

ot Sir Francis Galt^n in Na.tv.ral InheritanM suggest such a 
form for the distribution of "coosumptivity" amongst the oiT- 
spring of consumptives, but the figures are not in a decisive shape. 
Table XVI. gives the distribution for an analogous nase, viz. the 

Tablb xyt.^Slu>vnng the Fereatiagts of Deaf-imtUa among Children of 
I^Tenli one of leJwin at leati Jiia» a Deaf-mvie, for Marriages producing 
Five Childrtn or mart. (Compiled from material in Marriages of the Dtaf 
in America, ed. E. A. Fay, Volta Bureau, Washington, 1898.) 



Peroent^ 
Dear-mut«s. 


Number of 
Families. 


Peroentaga 

of 
Deaf-mutes. 


Number of 
Families, 


0-20 
20-*0 
10-60 


220 
20-6 
12 


60-80 
80-100 

Total 


6-6 
16 


m 



■Goo»^lc 



THE PREQUKNCY-DISTllIBUTION. 105 

distribution of deaf -mutism amongst the offepring of parents one 
of whom at least was a deaf mute. In general less than one-fifth 
of the children are deaf-mutes : at the other end of the range the 
cases in which over 80 per cent, of the children are deaf-mutes are 
nearly three times as many as those in which the percentage lies 
between 60 and 80. The numbers are, however, too small to form 
a very satisfactory illustration. 

REFERENCES. 

(1) Pbagson, K&kl, "Skew Variatioa In Homogeneoos Maiterial," Phil. 

Tram. Hoy. Son., Series A, toL cUrivi (1896), pp- 843-414. 

(2) Pbamok, Ka»l, "Clondineas; Note on a NovBlCasfl of Frequency," 

Proe. Roy. Soc., toI. Iiii. (18B7), p. 287. ■ 
(8) Pkassdn, Kahl, " Sapplement to & Momoir on Skew Variation," PKU. 

Trans. Roy. Soc, Series A, vol cxcvii. (IBOl), pp. 443-469. 
(4) PA.aKTO, ViLFBttDO, Cours d'iamomie poliligv« ; 2 Tols., Lausanne, 

1896-7. See «ep«ciitl1y tome ii., livre iii,, chap, i., " La oourbe des 

The Gist three memoira above are matliematiCHl m 
of ideal fl-equency-curves, the first being the funda 
tbs aacoud and third sapplemeotar;. The elementary student may, 
however, refer to them with advantage, on account of the large collection 
of frequency- diarributioDB which is given, and from which some of the 
illustrations in the preceding chapter have been cited. Without 
attampting to follow the mathematics, he may also note that each of 
rough empirical types may be divided into several aub-typea, the 



1. If the diagram fig. 9 is redrawn Co scales of 300 obseTvations per interval 
to the inch and 4 inches of stature to the inch, what ia the scale of observa- 
tions to the square inch ? 

If the scales are 100 observstiona per interval to the centimetre and 2 inches 
of stature to the centimetre, what ia the scale of observations to ths 
square centimetre ! 

2. If tig. 10 is redrawn to scales of 25 observations per interval to the inch 
and 2 per cent to the inch, what is the scale of observaljons to the 
square inch ) 

If the scales are 10 observations per interval to the centimetre and 1 per 
cent. Co the centimetre, what is the scale of observations to the square 
centimetre? 

3. If a frequency-polygon be drawn to represent the data of Table I., what 
number of observations will the polygon show between death-rates of 
16 '5 and 17 'G per thousand, instead of the true number ICflt 

i. If a frequency-polyeoD be drawn to represent the data of Table V., 
what number of observations will the polygon ahow between heod-breadtbs 
S "9? and 6 '06, instead of Hie true number 236 t 



n,gN..(jNGoogle 



CHAPTER VII. 
AVEBAQES. 

1. NecsBsity for qowitiUtive definition of the oharactera of a frequenoj- 
distribution— 2. MeosureB of position (averages) and of dispersion— a. 
The dimensions of an average the same as those of the variable — 4. 
Desirable properties for an average to poBseaa — 5. The commoner fomu 
of average— 6-13. The arithmetic mean : its definition, coloulation, and 
simpler properties — 11-18. The median :, its definition, calonlation, lod 
simpler propertiea— 19-20. The mode: its definition and relation to 
mean and median — 21. Summary comparison of the preceding forms 
of average — 22-26. The geometric mean : its definition, simpler pro- 
pertiea, and the cases in which it is specially applicable — 27. The 
harmonic mean : its definition and calculation. 

I. In g 2 of the last chapter it was pointed out that a olassification 
of the observations in any long series is the first step necessary 
to make the observatioDs comprehensible, and to render possible 
those compariaons with other series which are necessary for any 
discuaeion of causation. Very little experience, however, would 
show that classificalion alone is not an adequate method, seeing 
that it only enables quahtative or verbal comparisons to be made. 
The neit step that it is desirable to take is the quantitative 
definition of the characters of the frequency-distribution, so that 
quantitative comparisons may be made between the corresponding 
characters oE two or more series. It might seem at first sight 
that very difficult cases of comparison could arise in which, for 
example, we had to contrast a symmetrical distribution with a " J- 
shaped " distribution. ' As a matter of practice, however, we seldom 
have to deal with such a case ; distributions drawn from similar 
material are, in general, of similar form. When we have to 
compare the frequency -distributions of stature in two races of 
man, of the death-rates in English registration districts in two 
successive decades, of the numbers of petals in two races of the 
same species of EoTitmeuliit, we have only to compare with each 
other two distributions of the same or nearly the same type. 

2. Confining our attention, then, to this simple case, there are 
two fundamental characteristics in which such distributions may 



AVKBAQKS. 107 

differ : (1) they may differ markedly in positioo, i.e. in the values 
of the variable roimd whiSh they -centre, 'as in fig. 20, A, or (2) 
they may centre round the aame value,' but differ in the range ot 
variatiijiTbr disperiion, as it is termed, as in fig^20, 3. Ot course 
the Xatributions may differ in both characters at once, as io fig 20, 
C, but the two properties may be conaidered independently. 
Measures of the first character, poiitixm, are generally known as i/ 
averages ; measures of the second are termed measures of disper- 
sion. In addition to these two principal and fundamental 
characters, we may also take a third of some interest but ot much 
less importance, viz. the degree of aaymmetty of the distribution. 




The present chapter deals only with avert^es; measures of 
dispersion are considered in Chapter Vlll. and measures of 
asymmetry are also briefly discussed at the end of that chapter. 

3. In whatever way an average is defined, it may be as well to 
note, it is merely a certain value of the variable, and is therefore 
necessarily ofTrhe same dimension* as the variable: i.e. if the 
variable be a length, its average is a length ; if the variable be a 
percentage, its average is a percentage, and so on. But there are 
several different ways of approiimBtely defining the position ot a 
frequency -distribution, that ia, there are several different forms of 
average, and the question therefore arises, By what criteria are we 
to judge the relative merits of different forms 1 What are, in fact, 
the desirable properties for an average Jio possess) 



108 THEORY OP STATISTICS. 

4. (a) In the first place, it almost goes without saying that am 
average should be rigidly defined, and uot left to the mere estimation 
of the obaerver. An aver^;e that was merely estimated would 
depend too largely on the observer aa well as the data. (6) An 
average should bo baaed on all the obaervationa made. If not, 
it is not really a. characteristic of the whole distTibution. (c) It 
is desirable that the average should possess some simple and 
obvious properties to render its general nature readily compre- 
hensible : an average should not be of too abstract a mathematical 
oharacter. {d) It is, of courae, desirable that an avert^e should 
be calculated with reasonable ease and rapidity. Other things 
being equal, the easier calculated is the better of two forms of 
average. At the same time too great weight must not be attached 
to mere ease of calculation, to the neglect of other factors. («) 
It is desirable that the average should be as little affected as 
may be possible by what we have termed Jhtctvatiom of sampling. 
If different samples be drawn from the same material, however 
carefully they may be taken, the averages of the different samples 
will rarely be quite the same, but one form of average may show 
much greater differences than another. Of the two forms, the 
more stable is the better. The full discussion of this condition 
must, however, be postpon^ to a later section of this work 
(Chap. XVII.). (J) Finally, by far the most important desideratum 
ia this, that the measure chosen shall lend itself readily to 
algebraical treatment. If, e.ff., two or more series of observationa 
on similar material are given, the average of the combined series 
should be readily eiprwsed in terms of the averages of the 
component series : if a variable may be expressed as the sum of 
two or more others, the avenge of the whole shouldbe readily 
expressed in terms of the averages of its parts. A measure for 
which simple relations of this kind cannot be readily determined 
is likely to prove of somewhat limited application. 

5. There are three forma of average in common use, the 
arithmetic mean, the median, and the mode, the firat named being 
by far the most widely used in general statistical work. To 
these may be added the geometric mean and the bannonic mean, 
more rarely used, but of service in special cases. We will con- 
sider these in the order named. 

6. The arithmetic mean. — The arithmetic mean of a series of 
values of a variable ^j, JTj, Xj, . . . X^ JT in number, is the 
quotient of the sum of the vt^ues by their number. >. That is to 
say, if J/ be the arithmetic mean, 

M~Lx, + X, + X,+ . . .+X.), 

■" ■ ..i-,L-.oo»^lc 



ATBBAGES. 109 

or, to express it more briefly by uaing the symbol S to denote 
" the sum of all quantities like," 



M-^Z} 



(1) 



The word mean or average alone, without qualification, ia very 
generally used to denote this particular form of average : that 
is to say, when anyone apeaka of " the mean " or " the average " 
of a series of observations, it ma;, aa a rule, be assumed that the 
arithmetic mean is meant. It is evident that the arithmetic 
mean fulfils the conditions _laid down in (a) and (6) of jj i, tor it 
is rigidly deflned and based on all 'the ol»ervatioos made. 
Further, it fulfils condition (c), for its general nature ia readily 
comprehensible. If the wages-bill for X" workmen is £P, the 
arithmetic mean wage, P/JT pounds, is the amount that each 
would receive if the whole sum available were divided equally 
between them : conversely, if we are told that the mean wage 
ia £M, we know this means that the wagea-bill is N.M poi^nds. 
Similarly, if S" families possess a total of G children, the mean 
number of children per family is C/^— the number that each 
family would possess if the chilijren were shared uniformly. 
Conversely, if the mean number of children per family ia M, the 
total number of children in JV familiea is N.M. The arithmetic 
mean expresses, in fact, a simple relatioo between the whole 
and ita parts. 

7. As regards simplicity of calculation, the mean takes a high 
position. In the cases just cited, it will be noted that the mean 
is actually determined without even the necessity of determining 
OF noting all the individual values of the variable : to get the 
mean wage we need not know the wages of every hand, but only 
the wages-bill ; to get the mean number of children per family 
we need not know the number in each family, but only the tottu. 
If tfab total is not given, but we have to deal with a moderate 
number of observations — so few (say 30 or 40) that it is hardly 
worth while compiling the frequency-distribution — the arithmetic 
mean is calculated directly as suggested by the definition, i.e. 
all the values observed are added together and the total divided 
by the number of observations. But if the number of observations 
be large, this direct process becomes a little lengthy. It may 
be shortened considerably by forming the frequency-table and 
treating alt the values in each class as if they were identical with 
the mid-value of the clasa-interval, a process which in general 
gives an approximation that is quite sufficiently exact for prac- 
tical purposes if the class-interval has been taken moderately 



110 THEORY OF STATiaTICS. 

atnall (c/. Chap. VI. § 5). In this process each claaa-frequency 
is multiplied by the mid-value of the interval, the products added 
together, and the Uital divided by the number of observations. 
If/denote the frequency of any class, X the mid-value of the 
corresponding class-interval, the value of the mean so obtained 
may be written — 

«-y5(/J) ■ ■ ■ . (2) 

8. But this procedure is still further abbreviated in practice 
by the following artifices : — (1) The class-interval is treated 
as the unit of measurement throughout the arithmetic ; (2) tUe 
difference between (the mean and the^'mid-vaMe of some arbi- 
trarily chosen classrinterval is computed instead of the^ absolute 
value of the mean: 

It A be the arbitrarily chosen value and 

^^4 ■•I."": . . (3) 

2(/J0-2(/.yl)-HS0'-f). i .V-; 

or, since J ia a constant. 

The calculation of 2,(f.X) is therefore replaced by the calcula- 
tion of S,(/.iy. The advantage of this ie that the class-frequencies . 
need only be multiplied by small integral numbers; for A 
being the mid»talue of a class-interval, and X th^ mid-value of 
another, and the class-interval being treated as a unit, the f's 
must be a series of integers proceeding from zero at the arbitrary 
origin A. To keep the values of f as small as possible, A should 
be chosen near the middle of the range. 

It may be mentioned here that 2(i), or S(/.i) for the grouped 
distribution, is sometimes termed the^fet.JBflffiSZff of the distribu- 
tion about the arbitrary origin A : we shall not, however, make 
use of this term. 

9. The process is illustrated by the following example, using 
the frequency-distribution of Table VIII., Chap. VI, The 
arbitrary origin A is taken at 3'5 per cent., the middle of the 
sixth class-interval from the top of the table, and a little nearer 
than the middle of the range to the estimated position of the 
mean. The consequent values of f are then written down as in 
column (3) of the table, i^;ainst the corresponding frequencies, the 
values starting, of course, from zero opposite 3'5 per cent. Each 
frequency / is then multiplied by its $ and the products entered 



AVERAGES. Ill 

in another column (4), The positive and negative products are 
totalled separately, giving totals -776 and +509 respectively, 
whence S(/.© = - 267. Dividing thia by y, viz. 632, we have 
the difference of M from A in plaes-intervala, Tiz. 0'42_ intervals, 
that is 0-21 per cent. Hence the mean is 3-5-(0-21 = &-29 
per cent. 

Calcttlation op thb Mean : Eaximpte L^Cdlculatitm of Iht Arithmttu: 
Mean of the Pereenlagti of the Foputalion in reaipt of Seluf, from the 
Figv.TtiiifTa.bU VIIL, Chap. VJ.,p. B3. 



CD 


(2) 


(3) 


(*) 


Mid-rakcB 








of the 




Deviation 






Frequency 


from Arbitrary 


Product 


(Pereentogs in 
rsMipt of 
Relief). 

1 


/■ 




fi 


13 It 


- 6 


- »0 , 


1-6 


a k« 




- 192 


2 


72 '« 


- 8 


216 


2-6 


89 ;»7 


- 2 


178 . 


3 
3-5. 


100 117 

90 


- I 



100 


-776 










4 


76 


+ 1 


7& 


4-B 


60 


+ 2 


120 


G 


40 


+ 8 


120 


6-5 


21 


+ i 


84 * 


e 


11 


+ 6 


SS 


S-6 


G 


+ 6 


30 


7 


1 


, + ^ 


? 


7-6 


1 


+ 8 


8. 


8 




+ 9 




8-6 


] 


+ 10 


10 


Total 


^ 882 


- 


+ S09 



2t/i)=+509-776=-2B7 

taf, 

M-A= -555 elB99-intervBls= -0-42claaa-interT«la 
--0-21 units 
.-. mean J(=3-5-0-21= 3-29p*rc«nt 

It must always be remembered that ^{f.^jN gives the value of 
M-A in clasa-intervala, and must not be added directly to A 
unless the interval is also a unit. la the present illustration the 



112 



THEORY OP STATISTICS. 



interval is halt a unit, and accordingly the quotient i 

halved in order to obtain an answer in units. Care must also be 

taken to give the right sign to the quotient. 

10. Aa the process is an important one we give a second illustra- 
tion from the figures of Table VI., Chap. VI. lathis case the class- 
interval is a unit (1 inch), so the value of M—A is given directly 
by dividing 2(/.^) by JP. The student must notice that, measures 
having been made to the nearest eighth of an inch, the mid-values 
of the intervals are 57^5, 58^, etc., and not 57'5, 58-5, etc. 

Calcitlation op thb Mean r Example n.—CalmtJatiim of the ArithmUv: 
Mean SUUwre of Male Adults in thi Britilh IiUn from the Figures of 
Chap. VI., Table VI., p. 88. 



(1) 


(3) 


<8) 
• DeviatioD 


(4) 


fit: 




from Arbitrary 


Product 


/■ 


Valna A 


Jl- 


67- 


2 


-10 


20 


68- 


1 


- 9 


38 


69- 


14 


- 8 


112 


60- 


41 


- 7 


287 


81- 




- 6 


498 


6a- 


189 


- 6 


845 


68- 


894 


- * 


1578 


84- 


660 


- 3 


2007 


66- 


990 


- 2 


1990 


66- 
87- 
68- 


1238* 

1329 
12S0 


- h- 



-H 1 


1228 


-8684 


1230 


89- 


1063 


+ 2 


2126 


70- 


618 


* 3 


1938 


71- 


892 


+ 4 


1688 


72- 


202 


+ 6 


1010 


78- 


79 


-1- 8 


474 


lir- 


32 


+ 7 


22* 


76- 


IS 


+ 8 


128 


78- 


5 


+ 9 


46 


77- 
Total 


2 


-i-10 


30 


85SE 


_ 


-H876S' 











2K/^)=-^87S3-8684=■^■179 
if -.-*i + 1^ = +,:02 claaa-intervala or inehes. 

.■, if=87i^-i- -02 = 67 -46 inohoa |. Qoogje 



AVESAQES. 113 

It is evident that an absolute check on the arithmetic of any 
Huch calculation may be effected by taking a different arbitrary 
origin for the deviations ; all the figures of col. (4) will be changed, 
but the value ultimately obtained for the mean must be the 
same. The student should note that a classification by unequal 
intervals is, at best, a hindrance to this sifnple form of calculation, 
and the use of an indefinite interval for the estretuity of the 
distribution renders the exact calculation of the mean impossible 
(cf. Chap. VI. § 10). 

11. We return again below (§ 13) to the question of the 



JTT 




- - 


1 


/ 




^ 


. / 


_ 




r 




\"> 


/ 


"■\ 


i 


/ M 


V 


i 




\ ----- 


I" 




V 






\ 


*■;.>- I 




-^___ - . __ 


^ ' 




,. 



FeivaUaye of rAe popuiaUon, in. rectipl ef relU/' . 

Fio. 21. — Showing the Arithmetic Mean M, the Median ifi, and the Mode Mo, 
by verticals iIfbwii through the corresponding points on the base, for the 
distribatioD of pauperism of Gg. 10, p. 92. 

errors caused by the assumption that all values within the same 
interval may be treat«d as approximately the mid-value of the 
interval. It is sufficient to say here that the error is in general 
very small and of uncertain sign for a distribution of the 
symmetrical or only moderately asymmetrical type, provided of 
course the class-interval is not large (Chap. VI. g 5). In the case 
of the "J-shaped" or extremely asymmetrical distribution, how- 
ever, the error is evidently of definite sign, for in all the intervals 
the frequency is piled up at the limit lying towards the greatest 
frequency, i.e. the lower end of the range in the caae of the illustra- 
tions given in Chap. VI., and is not evenly distributed over the 



114 THBORT OF STATISTICS. 

interral. In diatributiona of such a type the intervals must be 
made very small indeed to secure an approiimately accurate value 
for the mean. The student should teat for himaelf the effect of 
different groupings in two or three different cases, so ^ to get 
some idea of the degree of inaccuracy to be expected. 

12. If a diagram has been drawn representing the frequency- 
distribution, the position of the mean may conveniently be 
indicated by a vertical through the corresponding point on the 
base. Thus fig. 21 (a. reproduction of fig. 10) shows tbe frequency- 
polygon for our first illustration, and the vertical MM indicates 
the mean. In a moderately asymmetrical distribution at all of 
this form the mean lies, as in the present example, on the side of 
the greatest frequency towards the longer " tail " of tbe diatribu- 




Fia. 22.— Mean M, Median ifi, and Mode Mo, of the ide»I modanitely 
asjminetricitl distribution. 

tion: if in fig. 22 shows similarly the position of the mean in 
an ideal distribution. In a symmetrical distribution the mean 
coincides with the centre of symmetry. The student should mark 
the position of the mean in the diagram of every frequency dis- 
tribution that he draws, and so accustom himself to thinking of 
the mean, not as an abstraction, but always in relation to the 
frequency-distribution of the variable concerned. 

13. The following examples give important properties of the 
arithmetic mean, and at the same time illustrate the facility of its 
algebraic treatment : — 

(a) The sum of the deviations from the mean, taken with their 
proper signs, is zero. 

This follows at once from equation (4) : for if M and A are 
identical, evidently ^(f.$) must be zero. t \)(iulr 



ATKRA6ES. 116 

(b) If a series of IT obBerrations of a. variable X consUt of, say, 
two component series, the metiD of the whole aeries can be 
readilj expressed in terms of the means of the two components. 
For if we denote the values in the first aeries bj X| and in the 
second series by X^, 

5(X)-S(Xi) + S{Js), 
that is, if there be Jfj observations in the first Beries and ^j in 
the second, and the means of the two series be M^, Af^ respectively. 

For example, we find from the daU of Table VI., Chap. VI., 

Mean stature of the 346 men bom in Ireland = 67*78 in. 

741 „ „ Wale8-66-62in. 

Hence the mean stature of the 1087 men bom in the two countries 
is given by Che equation — 

lOST.M^ (346 X 67-78) + (741 x 66-82). 

That is, if =66-99 inches. It is evident that the form of the 
relation (6) is quite general ; if there are r series of observations 
Xj, Xj . . . . X„ the mean M of the whole series is related to 
the means ifj, M^ . . ^. M„ of the component series by the 
equation 

N.M=NyM^ + N^.M^+ .... +N^M^ . - (6) . 

For the convenient checking of arithmetic, it is useful to note 
that, if the same arbitrary origin A for the deviations ^ be taken 
in each case, we must have, denoting the component series by the 
subscripts 1, 2, ... r as before, 

S(/«)-sW.£,)+5(/,a+ +S(/.-« ■ (V) 

The agreement of these totals accordingly checks the work. 

As an important corollary to the general relation (6), it may 
be noted that the approximate value for the mean obtained from 
any frequency distribution is the same whether we assume (1) 
that all the values in anyclass are identical with the mid-value 
of the class-interval, or (2) that the mean of the values in the 
class is identical with the mid-value of the class-interval. 

(e) The mean of all the sums or differences of corresponding 
observations in two- series (of equal numbers of observations) is 
equal to the sum or difference of the means of the two series. 

This follows almost at once. For if 
X=X-^±X^ 
%{X) = 2(X,) ±S(Xs). ■ I ■- C-.00<^lc 



116 THKOHT OP STATISTICS. 

That is, if M, M^, M.^ be the respective meana, 

M=M^±M^ . (8) 

Evidently the form of this result is again quite general, so that 
if 

X-=X^±X^± .... ±X„ 

jf=j/,±i/3± .... +J/; . . (9) 

As a useful illustration of equation (S), consider the case of 
measurements of any kind that are subject (aa indeed all 
measureB must be) to greater or leas errors. The actual measure- 
ment X in any such case is the algebraic sum of the true 
measurement X, and an error X^. The mean of the actual 
measurements jlf is therefore the sum of the true mean M^, and 
the arithmetic mean of the errors M^. If, and only if, the 
latter be zero, will the observed mean be identical with the true 
mean. Errors of grouping (g 11) are a case iu point. 

^ 14. The median. — The median may be defined as the middle- 
moat or central value of the variable when the values are ranged 
in order of magnitude, or as the value such that greater and 
smaller values occur with equal frequency. In the case of a 
frequency-curve, the median may be defined as that value of the 
variable the vertical through which divides the area of the curve 
into two equal parts, as the vertical through Mi in fig. 22. 

The median, like the mean, fulfils the conditions (b) and (c) 
of § 4, seeing that it is based on all the observations made, and 
that it possesses the simple property of being the central or 
middlemost value, ho that its nature is obvious. But the defini- 
tion does not necessarily lead in all cases to a determinate value. 
If there be an odd number of different values of X observed, say 
2ra-l-l, the (fl.+ l)th in order of magnitude is the only value 
fulfilling the definition. But if there be ao even number, say 
2n. different values, any value between the mth and (n '+ l)th 
__fulfils the conditions. In such a case it appears to be usual to 
take the mean of the nth and (n-{-l)th values as the median, 
but this is a convention supplementary to the definition.. It 
should also be noted that in the case of a discontinuous variable 
the second form of the definition in general breaks down : if we 
range the values io order there is always a middlemost value 
(provided the number of observations be odd), but there is not, as a 
rule, any value such that greater and less values occur with equal 
frequency. Thus in Table III., S 3 of Chap. VI., we see that 45 per 

' cent, of the poppy capsules had 12 or fewer stigmatic rays, 65 
per cent, had 13 or more ; similarly 61 per cent, had 13 or fewer 
rays, 39 per cent, had 14 or more. There is no number of rays 



J 



AVKRA6K& 117 

sucli that the frequencies in eicees and defect are equal. 
In the case of the buttercups of Table XIV. (Chap. VI. % 15) 
there is no number of petals that even remotely fulfils the 
required condition. An analogous difficulty may arise, it may 
be remarked, even in the case of an odd number of obaervationB ' 
of a continuous variable if the number of observations be small 
and several of the observed values identical. The median is 
therefore a form of average of m<ffit uncertain meaning in cases 
of strictly discontinuous variation, for it may be exceeded by 
5, 10, 15, or 20 per cent, only of the observed values, instead of 
by 50 per cent. : its use in such cases is to be deprecated, and 
is perhaps best avoided in any case, whether the variation be 
continuous or discontinuous, in which small series of dbservations 
have to be dealt with, 
*-■ 15, When a table showing the frequency -d is tribution for a 
long series of observations of a continuous variable is given, no 
difficulty arises, as a sufficiently approximate value of the median 
can be readily determined by simple interpolation on the hypo- , 
thesis that the values in each class are uniformly dbtributed 
throughout the interval. Thus, taking the figures in our first 
illustration of the method of calculating the mean, the total 
number of observations (registration districts) is 632, of which 
the half is 316. Looking down the table, we see that there are 
227 dist^cte with not more tha;^ 2-75 per cent, of the population 
in receipt of relief, and 100 more wi^h between '2'75 and 3*26 
per cent. But only 89 are required to make up the total of 316 ; 
hence the value of the median is taken as 

2-76i+^,i = 2-75 + 0445 



""100 



= 3'195 percent. 



The mean beit^ 3*29, the median is slightly less ; its position 
is indicated by Mi in fig. 21. 

The value of the median stature of males may be similarly 
calculated from the data of the second illustration. The work 
may be indicated thus : — 

Half the total number of observations (8585) = 4292'5 
Total frequency under 66^ inches . . =3589 

Difference = 7035 

Frequency in next interval . , , =1329 

Therefore median = 66JJ + 

= 6747 inches. nigN^.tJi-vGoOgle 



118 



THEORY OP STATISTICS. 



The difference between median &nd mean in this case is 
therefore only about one-hundredth of an inch, the BmallneBS 
of the difference ariaiog from the approiimate symmetry of 
the diatributioD. In an absolutely symmetrical distribution 
it is evident that mean and median must coincide. 

16. Graphical interpolation may, it desired, be substituted 
for arithmetical interpolation. Taking, again, the figures of 
Example i,, the number of districts with pauperism not exceeding 
3-25 is 138; not exceeding 2-76, 227 ; not exceeding 3-25, 327; 
and not exceeding 375, 417. Plot the numbers of districts 
with pauperism not exceeding each value X to the corresponding 







■" 




■ J 


5 












^ 










/ 


/ 












/ 




















/ 


/ 












/ 












/ 


/ 
























" 


2 


s 


■* 


3 


S 





Fro. 28. — D«tenuiii>tionof the median bjgrsphiottl interpolation. 

value of X on squared paper, to a good lai^e scale, aa in fig. 33, 
and draw a smooth curve through the points thus obt^ned, 
preferably with the aid of one of the "curveis," splines, or flexible 
curves sold by instrument-makers for the purpose. \The point 
in which the smooth curve so obtained*cu1^ the horizontal line 
corresponding to a total frequency N/2 = 3|^6 gives the median. 
Tn general the curve is bo flat that the value obtained by this 
graphical method does not differ appreciably from that calculated 
arithmetically (the arithmetical process assuming that the 
curve is a straight line between the points on either side of 
the median) ; if the curvature is considerable, the graphical 
value— assuming, of course, careful and accurate draughtsmanship 
— is to be preferred to the arithmetical value, as it doee not 



AVEEAGES. 119 

inTolve the crude assumption that the frequency is uniformly 
distributed over the interval in which the median lies. 

17. A comparison of the calculations for the mean and 
for the median respectively will show that on the score of 
brevity of oalculation the median has a distinct advanU^e. 
When, however, the ease of algebraical treatment of the two 
forms of average is compared, the superiority lies wholly on 
the side of the mean. As was shown in g 13, when several series 
of observations are combined into a single series, the mean of 
the resultant distribution can be simply expressed in terms 
of the means of the components. The expression of the 
median of the resultant distiibution in terms of the medians 
of the components is, however, not merely complex and difficult, 
but impossible : the value of the resultajkt median depends on 
the forms of the component distributioos, and not on their 
medians alone. If two symmetrical distributions of the same 
form and with the same numbers of observations, but with 
difierent medians, be combined, the resultant median must 
evidently (from symmetry) coincide with the resultant mean, i.e. 
lie halfway between the means of the components. But if the 
two components be asymmetrical, or (whatever their form) 
if the degrees of dispersion or numbers of observations in the 
two series be difterent, the resultant median will not coincide 
with the resultant mean, nor with any othei* simply assignable 
value. It is impossible, therefore, to give any theorem for 
medians analogous to equations (5) and (6) for means. It is 
equally impossible to give any theorem analogous to equations 
(8) and (9) of g 13. The median of the sum or difference of 
pairs of corresponding observations in two ' series is not, 
in general, equal to the sum or difference of the medians of 
the two series ; the median value of a measurement subject to 
error is not necessarily identical with the true median, even 
if the median error be zero, i.e. if positive and- negative errors 
be equally frequent. 

18. These limitations render ^e applications of the median in 
any work in which theoretical ooneiderations are necessary com- 
paratively circumscribed^ On the other hand, the median may 
have an advantage oji|^he mean for special reasons, (a) It is 
very readily calcaStted ; a factor to which, however, as already 
stated, too much weight ought not to.be attached. (6) It is 
readily obtained, without the necessity of measuring all the 
objects to be observed, in any case in which they can be arr^ged 
by eye in order of magnitude. If, for instance, a number of men 
be ranked in order of stature, the stature of the middlemost is 
the median, and he alone need be measured, (On the other hand 



120 THIORY OF STATISTICS. 

it is useless in the cases cited at the end of § 6 ; the median wage 
casDOt be found from the total of the wagea-bill, and the total 
of the wftgea-bUl is not known when the median is given.) (c) It 
is sometimes useful as a makeshift, when the observations are so 
given that the calculation of the mean is impossible, owing, e.g., to 
a final indefinite class, as in Table IV. (Chap. VI. § 10). (d) The 
median may sometimea be preferable to the mean, owing to its 
being less affected by abnormally large or small values of the 
variable. The stature of a giant would have no more influence 
on the median stature of a number of men than the stature of 
any other man whose height is onlj just greater than the median. 
If a number of men enjoy incomes closely clustering round a 
median of £500 a year, the median will be no more affected by 
the addition to the group of a man With the income of £50,O0U 
than by the addition of a man with an income of £5000, or even 
£600. If observations of any kind are liable to present occasional 
greatly outlying values of this sort (whether real, or due to 
errors or blunders), the median will be more stable and teas 
affected by fluctuations of sampling than the arithmetic . mean. 
(In general the mean is the less affected.) The point is discussed 
more fully later (Chap. XVII.). 

19. The Mode. — The mode is the value of the variable corre- 
sponding to the maximum of the ideal frequency-curve which 
^ves the closest possible fit to the actual distribution. 

It is evident that in an ideal symmetrical distribution mean, 
median and mode coincide wU^h the centre of symmetry. If, 
however, the distribution be asymmetrical, as in fig. 22, the three 
forms of average are distinct. Mo beiug the mode. Mi the median, 
and M the mean. Clearly, the mode is an important form of 
avenge in the cases of skew distributions, though the term is of 
reoent introduction (Pearson; ref. 11). It represents the value 
which is most frequent or typical, the value which is in fact the 
fashion {la mode). But a difficulty at once arises on attempting 
to determine this value for such distributions as occur in practice. 
It is no use giving merely the mid-value of the class-interval into 
which the greatest frequency falls, for this is entirely dependent 
on the choice of the scale of class-intervals. It is no use making 
the class-intervals very small to avoid error on that account, for 
the class-frequencies will then become small and the distribution 
irregular. What we want to arrive at is the mid-value of the 
interval for which the frequency would be a maximum, if the 
intervals could be made indefinitely small and at the same time 
the number of observations be so increased that the class-frequen- 
cies should run smoothly. As the observations cannot, in a 
practical case, be indefinitely increased, it is evident that some 



AVERAOES. 121 

process of amoothiDg out the irregularities that occur in the 
actual distribution must be adopted, in order to ascertain the 
approximate value of the mode. But there is only one smoothing 
process that is really satisfactory, in so far as every observation 
can be taken into account in the determination, and that is the 
method of fitting an ideal frequency-curve of given equation to 
the actual figures. The value of the variable corresponding to the 
maximum of the fitted curve is then taken as the mode, in 
accordance with our definition. Mo in fig. 21 Is the value of the 
mode so determined for the distribution of pauperism, the value 
2'99 being, as it happens, very nearly coincident with the centre 
of the interval in which the greatest frequency lies. The deter- 
mination of the mode by this — the only strictly satisfactory — 
method must, however, be left to the more advanced student. 

20. At the same time there is an approximate relation between 
mean, median, and mode that appears to hold good with surprising 
closeness for moderately asymmetrical distributions, approaching 
the ideal type of fig. 9, and it is one that should be borne in 
mind as giving — roughly, at all events — the relative values of 
these three averages for a great many cases with which the 
student will have to deal. It is expressed by the equation^ 

Mode = Mean - 3(Mean — Median). 

That is to say, the median lies one-third of the distance from the 
mean towards the mode (compare figs. 21 and 22). For the 
distribution of pauperism we have, taking <be mean to three 
places of decimals, — 

Mean 3-289 

Median .... 3-19S 



Hence approximate mode ^ 3-289 - 3 x 0094 

= 3-007, 

or 3-01 to the second place of decimals, which is sufficient accuracy 
for the final result, though three decimal places must be retained 
for the calculation. The true mode, found by fitting an ideal 
distribution, is 2-99. As further illustrations of the closeness 
with which the relation maybe eipected to hold in different cases, 
we give below the results for the distributions of pauperism in 
the unions of England and Wales in the years 1850, 1860, 1870, 
1881, and 1891 (the last being the illustration taken above), 
and also the results for the distribution of barometer heights at 



THEORY OF STATISTICS. 



Compariirm n/ the ApproximaU and True Modes in the Case of Five Dis- 

trilnUioTis of Fav.ferisin (Perceniagea of Ike Populaiion in receipt "f 

Relief) in the Unicnis of Sngland and Wales. (Yule, Jouv. Ray. Slat. 
Soe., vol. lis.. 1896.) 



Year. 


M«an. 


Hedian; 


A^iproiimiite 
Mode. 


True Mode. 




6-G08 


e-2Bi 


5-787 




1S60 


6 186 


S-000 


4-eio 


4 -667 


1870 


6461 


5-880 


6-23a- 


.5-038 


1881 


8-876 


8-623 


8-217 


8-240 


18S1 


3-2Se 


3196 


a -007 


2-887 



Con^iariaoa of Oa Approximaie and True Modtt in the Case of Five Dit- 
tribulUna of the Height of the BaromeUr for Daily Observations at the 
Stations named. (Distributions fi^-veii hy Karl Pearson and Alice Lee, 
Phil. Trans , A, vol. cic. (1897), p. 423.) 



Station. 

Caranrthen 
Glasgow . . 
Dundee . 


Mean. 


Median. 


Approximate 


True Mode. 


29-981 
29-891 
29-862 
29-886 
29'870 


80-000 
29-916 
29B74 
29-900 
29-890 


80-038 
29-983 
80-018 
29-846 
29-930 


30-088 

28-980 
30013 
29-967 
28-961 



It will be seen that in the case of the pauperism figures the 
approximate mode only diverges markedly from the true value 
in the year 1870, a year in which the frequency-distribution was 
very irregular. In ail the other years the diflerence between the 
true and approximate values of the mode is hardly greater than 
-.the alteration that might be caused in the true mode itself by 
slight variations in the method of fitting the curve to the actual 
distribution. Similar remarks apply to the aecoad series of illus- 
tratioiw ; the true and approximate values are extremely close, 
except in the case of Dundee and Glasgow, where the divergence 
reaches two-hundredths of au inch. 

31. Summing up the preceding pan^raphs, »e may aay that 
the mean is the form of average to use for all general purposes ; 
it ia simply calculated, its value is always determinate, its 
algebraic treatment is particularly easy, and in most cases it is 



AVEBAOEB. 123 

rather leBS affected than the mediaa by errors of Bampling. The 
median is, it is true, somewhat more easily calculated from a given 
frequency-distribution than is th6 mean ; it is aometimea a useful 
makeshift, and in a certain otass of oases it is more and not less 
stable than the mean; but its use is undesirable incases of discon- 
tinuous variation, its value may be indeterminate, and its algebraic 
treatment is difficult and often impossible. The mode, finally, 
is a form of average hardly suitable for elementary use, owing 
to the difficulty of its determination, but at the same time it 
represents an important value of the variable. The arithmetic 
mean should invariably be employed unless there is some very 
definite reason for the choice of another form of average, and the 
elementary student will do very well if he limits himself to its 
use. Objection is sometimes taken to the use of the mean in the 
case of asymmetrical frequency-distributions, on the ground that 
the mean is not the mode, and that its value is consequently mis- 
leading. But no one in the least degree familiar with the 
manifold forms taken by frequency-distributions would regard the 
two as in general identical, and while the importance of the mode 
is a good reason for stating its value in addition t« that of the 
mean, it cannot replace the latter. The objection, it may be 
noted, would apply with almost equal force to the median, for, as 
we have seen (§ 20), the difierenoe between mode and median 
is usually about two-thirds of the difference between mode and 
mean. 

22. 7%« Oeometric Meain. — The geometric mean <? of a series of 
values X^, X^ Xj, .... X„, is defined by the relation 

(?-(Xj.X,.X, X„); . . . (10) 

The definition may also be expressed in terms of logarithms, 

loge=i2{logX) . . . (11) 

that is to say, the logarithm of the geometric mean of a series of 
values is the arithmetic mean of their logarithms. 

The geometric mean of a given series of quantities is always 
less than their arithmetic mean ; the student will find a proof in. 
most text-books of algebra, and in ref. 10. The magnitude of 
the difference depends largely on the amount of dispersion of the 
variable in proportion to the magnitude of the mean (ef. Chap. 
VIII., Question 8). It is necessarily zero, it should be noticed, if 
even a single value of X is zero, and it may become imaginary if 
negative values occur. Excluding these cases, the value of the 



124 THBORY OF STATI8TIC8. 

geometric meaa is always deteroiinate aud ia rigidly defined. The 
computation is a little long, owing to the necessity of taking 
logarithms : it is hardly necessary to give an example, as the 
method is simply that of finding the arithmetic mean of the 
logarithms of X (instead of the values of X) in accordance with 
equation (11). If there are many obaerrations, a table should be 
drawn up giving the frequency-distribution of 1<^ X, and the 
mean should be calculated as in Examples i. and ii. of ^ 9 and 10. 
The geometric mean has never come into general use as a repre- 
sentative average, partly, no doubt, on account of its rather 
troublesome computation, but principally on account of its some- 
what abstract mathematical character (of. § 4 (c) ) ; the geometric 
mean does not possess any simple and obvious properties which 
render its general nature readily comprehensible. 

23. At the same time, as the following examples show, the 
mean possesses some important properties, aud ia readily treated 
algebraically in certain cases, 

(a) It the series of observations X consist of r component 
series, there being Ny observations in the first, N^ in the second, 
and BO on, the geometric mean G of the whole series can be 
readily expressed in terms of the geometric means G^, 6^, etc., of 
the component series. For evidently we have at once (as in S 13 

m- 

#.loge-^■,.loge,-^JV^J.l(^GJ^- .... +N^\ogG,. (12) 

(i) The geometric mean of the ratios of corresponding observa- 
tions in two series is equal to the ratio of their geometric means. 
For if 

X=XJX^ 
log X = log Xj - log X^, 
then sumniing for all pairs of X,'s and X^'s, 

G~GJG^ .... (13) 
(e) Similarly, if a variable X is given as the product of any 
number of others, i.e. if 

X=XyX^.X, .... X, 

Xj, Xj, .... X, denoting corresponding observations in r 
different series, the geometric mean G of X is expressed in terms 
of the geometric means Gj, G^, . . . . G, of X,, Xj, .... X„ by 
the relation 

G = G^.G^.G^ ....(?,. . . (14) 

That is to say, the geometric mean of the product ia the product 

of the geometric means. "^ 8 ' 



ATBRAQES. 126 

24. The use of the geometric mean finds its simplest application 
in eetimating the numbers of a population midway between two 
epochs (say two census years) at which the population is known. 
If nothing is known concerning the increase of the population 
save that the numbers recorded at the first ceneua were P^ and at 
the second census n years later P„ the most reasonable asaump- 



Qurti^T-tan^ 




tion to make is that the percentage increase in each year has 
been the same, so that the populations in succesaive years form a 
geometric series, P^r being the population a year after the first 
census, Pfpr^ two years after the first census, and so on, and 

P^^Pa-r" .... (15) 

The population midway between the two censuses is therefore 

P*!=.Po-^=(^tt-P-)* ...I .Goo^lu(ifi) 



126 THBOBT OF STATISTICS. 

i.e. the geometric mean of the numbers given by the two censuses. 
This result must, however, be used with disoretion. The rate of 
increase of population is not necessarily, or even usually, constant 
oyer any considerable period of time : if it were bo, a curve 
representing the growth of population as in fig, 24 would be 
continuously convex to the base, whether the population were 
increasing or decreasing. In the diagram it will be seen that 
the curves are frequently oonoave towards the base, and similar 
results will often be found for districts in which the population is 
not increasing very rapidly, and from which there is much 
emigration. Further, the assumption is not self-consistent in any 
case in which the rate of increase is not uniform over the entire 
area — and almost any area can be analysed into parts which are not 
similar in this respect. For if in one part of the area considered 
the initial population is P^ and tbe common ratio It, and in the 
remainder of the area the initial population ispg and the common 
ratio T, the population in year n is given by 

This does not represent a constant rate of increase unless Ji = r. 
If then, for example, a constant percentage rate of increase be 
assumed for England and Wales as a whole, it cannot be assumed 
for the Counties : if il be assumed for the Counties, it cannot be 
assumed for the country as a whole. The student is referred to 
refs. 14, 15 for a discussion of methods actually used for tbe con- 
sistent estimation of populations under such circumstanceB, 

25. The property of the geometric mean illustrated by equation 
(13) renders it, in some respects, a peculiarly convenient form of 
average iu dealing with ratios, i.e. "index-numbers," as they are 
termed, of prices. Let 

X^, T"^ X"„, .... ^0 

Xj, JTj, X"\, . . . . X\ 

Xj, X\ X"\, . . . . X\ 



denote the prices of JV conunodities in the years 0, 1, 2 . 
Further, let K^^X^X^ and so on, so that 



represent tbe ratios of the prices of tbe several commodities in years 
i, 2, ... to their prices in year 0. These ratios, in practice 
multiplied by 100, are termed index-nwmben of the prices of the 
several commodities, on the year as ba^. Exi^ently some 






AVKKAGKS. 127 

form of average of the ¥"» for any given year will afford an 
indicatioa of the general level of prices for that year, provided the 
commodities choaen are sufficiently numerous and representative. 
The question iSj what form of average to choose. If the geometric 
mean be chosen, and O^a, G^ denote the geometrio means of the 
Pb for the years 1 and 2 respectively, we have 

-^ w T"ao _M V i 

~\JC^ ■ X\ " X"\ ' ' ' JCJ 

= (Y\i- Y%i- J""2i ■ ■ ■ ■ Y'si)^^ 
From the first form of this equation we see that the ratio of the 
geometric mean index-number, in year £ to that in year 1 is 
identical with the ^geometric mean of the ratios for the index- 
numbers of the several commodities. A similar property does 
not hold for aoy other form of average : the ratio of the arithmetic 
mean index-numbers is not the same as the arithmetic mean of 
the ratios, nor is the ratio of the medians the median of the 
ratios. From the second and third forma of the equation it 
appears further that the ratio of the geometric mean index- 
number in year £ to that in year 1 is independent of the prices in 
the year first chosen as base (i.e. year 0), and ia identical with the 
geometric mean of the index-numbers for year 2, on year / as 
base. Again, a similar property does not hold for any other form 
of average. If arithmetic means of the index-numbers be taken, 
for example, the ratio of the mean in year 2 to the mean in year 
J will vary with the year taken as base, and will differ more or 
lees from the arithmetic mean ratio of the prices in year 8 to the 
prices of the same commodities in year 1 ; the same statement is 
true if medians be used. The results given bj the use of the 
geometrio mean possess, therefore, a certain consistency that is 
not exhibited if other forms of average are employed. It was 
used in a classical paper by Jevons (ref. i), though not on quite 
the same grounds, hut has never been at all generally employed. 

26. The general iise of the geometric mean has been suggested 
on another ground, namely, that the magnitudes of deviations 
appear, aa a rule, to be dependent in some degree on the magni- 
tude of the averse ; thus the length of a mouse varies less than 
the stature of a man, and the height of a shrub less than that of 
a tree. Hence, it is argued, variations in such eases should be 
measured rather by their ratio to, than their difference from, the 
average ; and if this is done, the geometric mean is the natural 
average to use. If deviations be measured in thia way, a 



128 



THEORY OP STATISTICS. 



deviation Gjr will be regarded aa the equivalent of a deviation >.C, 
instead of a deviation -x as the equivalent of a deviation -^-x. 
If a distribution take the simpleat possible form when relative 
deviations are regarded aa equivalents, the frequency of deviations 
between G/« and Gjr will be equal to the frequency of deviations 
between r.G and «,<?. The frequency-curve will then be sym- 
metrical round log G if plotted to log X as base, and if there be 
a aii^le mode, log G wiU be that mode— a logarithmic or geometric 
mode, aa it might be termed : G will not be the mode if the distri- 
bution be plotted in the ordinary way to values of X as base. 
The theory of such a distribution has been discussed by more than 
one author (refs. 2, 8, 9), The general applicability of the assump- 
tion made does not, however, appear to have been very widely 
tested, and the reasons assigned have not sufficed to bring the 
geometric mean into common use. It may be noted that, as the 
geometric mean is always less than the arithmetic mean, the 
fundamental assumption which would justify the use of the former 
clearly does not hold where the (arithmetic) mode is greater than 
the arithmetic mean, as in Tables X. and XI. of the last chapter. 

27. The Harmonic Mean. — The harmonic mean of a series of 
qiiantities is the reciprocal of the arithmetic mean of their 
reciprocals, that is, if H be the harmonic mean, 

". . . (18) 

The following illustration, the result of which is required for an 
example in a later chapter (Chap. XIII. § 11), will serve to show 
the method of calculation. 

The table gives the number of Utters of mice, in certain 
breeding experiments, with given numbers (X) in the litter. (Data 
from A. D. Darbishire, Biometrika, iii. pp. 30, 31.) 



S~lf^\X, 



Number in 


Number of 




Litter. 


littera. 


//X. 


X. 


/. 




1 


7 


7-000 


2 


11 


6 '600 


3 


le 


5 -333 




17 


4-260 


5 


'26 


6200 


a 

7 


81 
11 


5-187 
1-571 


S 


1 


0-125 


9 




0-111 


- 


121 


34 -257 



Goo»^lc 



ATBBAQES. 129 

Whence, IjU^ 0-2831, H= 3-532. The arithmetic mean is 4:'587, 
or more than a unit greater. 

If the prices of a commodity at different places or times are 
stated in the form "so much for a unit of money," and an average 
price obtained by taking the arithmetic mean of the quantities 
sold for a unit of money, the result ia equivalent to the harmonio 
mean of prices stated in the ordinary way. Thus retail prices of 
eggs are usually quoted in England as " so many to the shilling," 
Supposing we had 100 returns of retail prices of eggs, 50 returns 
showing twelve eggs to the shilling, 30 fourteen to the shilling, 
and 20 ten to the shilling j then the mean number per billing 
would be 12-2, equivalent to a price of 0'984d. per e^. But 
if the prices had been quoted in the form usual for other com- 
modities, we should have had 50 returns showing a price of Id. 
per egg, 30 showing a price of 0-857d., and 20 a price of I'2d. ; 
arithmetic mean 0'997d., a slightly greater value than the har- 
monic mean of 0'984, The official returns of prices in India were, 
until 1907, given in the form of "Sers (2'057 lbs.) per rupee." 
The average annual pripe of a commodity was based on half- 
monthly prices stated in this form, and " index-oumbers " were 
calculated from such annual averages. In the issues of " Prices 
and Wages in India" for 1908 and later years the prices have 
been stated in terms of " rupees per maund (82-286 lbs.)," The 
change, it will be seen, amounts to a replacement of the harmonic 
by the arithmetic mean priceTp 

The harmonic mean of a aeries of quantities is always lower 
than the geometric mean of the same quantities, and, it fortiori, 
lower than the arithmetic mean, the amount of difference depend- 
ing largely on the magnitude of the dispersion relatively to the 
magnitude of the mean. {Cf. Question 9, Chap. VIII.) 

REFEREHCES. y^ 

General. 

(1) FloHNBa, G. T. " Uflber dan Ausganfjswerth der kloinBten Abweicb- 

nngs9Umme, deesen Bestinmung, VerweaduQg und Terallgemein- 
erung," AbK. d. kgl. adchnschtn QtteUxkaft d. Wiasensdiaflea, vol. 
zriii. (also numbered xi. of the Ahh. d. Tnoih.-pkys. Oia»st); Leipzig 
(1878), p. 1. (The HTarage defined aa the origin from which the 
dispersion, measured id one way or another, is a minimum : geometric 
mean dealt with incidentally, pp. 13-16.) 

(2) Fbobnbh, G. T., KollddiitmaidehTe, herausgegeben von G. F. Lippa ; 

Engelmsnn, Leipzig, 1897. [ Posthnmonsiy pablished : deals with 
&ei[Denc;-diatributions, their forms, averaf^, and measnrea of dis- 
persion in general : includes much of the matter of ( 1 ). ) 

(3) ZlZEK, Franz, ike slatiitfachen MittelvKTihe ; Duncker und Humblot, 

Leipzig, 1908. (Non-mathematical, bntnsefal to the ecODomic stndeat 
far references oit«d.) iHjJc 

ft o 



THKOBY 01 STATISTICS. 



The 0«ometiic Mean. 



(4) JKVOHa, W. Stanlst, a Sariaus Fall in the Vaiut ofChMat 

and iti Social Effai» aUforUi; St&nfaid, London, 1868. Bepriutod 
in Iimaligatiima in Curreney and finance ; MoomilUn, London, 1S81. 
(The geometric mean applied to the msBsnrement of price chai^ea. ) 

(5) Jevonb, W. Stanley, "On the Variation of Prices and the value of 

the Cuneno; since 1782," Jaw. Boy. Slat. Soc, voL xxviii., 1866. 
Also reprinted in volume cited above, 

(6) Ei>OEwoKTH, F. Y., "On the Method of aacartsining a Change in the 

ViilttB of Gold." Jour. Boy. Stat. Sik., vol. ilvi, 1883, p. 71*. (Some 
oriticism of the leasons assigned hj Jevons for the use of the geometric 

(7) Oalton, Fulnois, " The Geometric Mem in Vital and Social StatJutiM," 

Pmc. Soy. Soc., vol, iziz., 1879, p. 365. 

(8) McAlibtxb, Donald, "The Law oftheQeametric Mean," t^,, p. 367. 

(The law of frequency to which the use of the geometric mean would 
be appropriate. ) 

(9) Eaptitk, J. C., 3kttD Frammeg-eurva «i Biology and Siatiaia ; 

Noordhoff, Oroningen', and wm. Dawson, London, 1S03. (Contaios, 
amongst other forms, a ^netalisation of MoAlister's law. ) 

(10) CRAWFORD, O. E., " An Elementary Proof that the Arithmetic Mean 

□f an; number of Positive Quantities ia greater than the Geometric 
Mean,"iVi>«. Sdin. Math. Soc, vol. iviiL, 1899-1900. 
See also refg. 1 and 2. 

The Mode. 

(11) Fbabson, Karl, "Skew Variation in Homogeneous Material," Fhtl. 

Trans. Soy. Soc, Series A, vol. clzzxvi., 18eC>, p. 843. (DeSoitioD of 
mode, p. 315.) 

(12) YuLS, G. U., "Notes on the History of Pauperism in England and 

Wales, etc : Supplementary Note on the Determination of tte Mode," 
JouT. Jfoy. Stat. Soc, vol. lix., 1896, p. 343. (The note deals with 
elementary methods of approximately determining the mode : the one- 
third rule and one other. ) 

(13) Peabson, Kabl, "Od the Modal Value of an Organ or Character," 

BioTnetriia, vol, i,, 1902, p. 260, (A warning as to the inadequacy of 
mere inspection fur determiniDg the mode.) 

Eatlnutfts of Popttlatlon. 

(11) Watkbs, a. C, "A Method for estimating Mean Popnlatious in the 
last Intercensal Period," /imr. JJt^. Slat. Soc, vol. Imv., 1901, p. 293. 

(IB) Wateks.A.C, EaUmaieao/ropulalion: Supplement Co tli^ 66th Annval 
SepoH Iff the Segiilrar-Oentral/or England and WaU3(CA. 2018, 1907}, 
p. oxvii. 

Indez-nniabera. 

These were incidentally referred to in g 26, The general theory of 
index-numbers and the diJferent methods in which they may be formed 
are not conHidered in the present work. The student will find copious 
references to the literature in the following r — 
(16) Edobwobth, F. Y,, "Reporta of the Committee appointed for the 
purpose of investigating the best methods of ascertaining and measuring 



AYERAOES. 



131 



TariationB in the Value of the Monetary Stanifard," Britiih AeaoeiMion 
Beportt, 1887 (p. 247), 188S (p. 181), 1889 (p. 133), and 1890 (p. 485). 

(17) Edobworth, F. Y., Article " Indei-nnmbarB " in Palgmve's ZWdKntary 

of FoliiiaU Economy, vol. ii. ; Maomillan, 1S9S. 

(18) FocNTAiK, H,, " Uemorandum on the Construction or Index-nnmbera 

of Pricea," in the Board of Trade Eijmi on WhoUtalt and Relail 
Prica in the United Kingdom," 1608. 



is and medians from the data of Table VI., 

Stature in laches for Adult Hales in — 
Bn^and. Scotland. Walea. Ireland. 
Mean . . fl7'3] 88-65 66-62 87'78 

Median . . 67'35 88-48 6S'GS 67-09 

In the calculation of the means use the same arbitrary origin as in Example 
ii , and check your work b; the method of ^ 13 (b). 

2. Find the mean weight of adult males in the United Kingdom from the 
data in the last column of Table IX,, Chap. VI, Also Gnd the median weight, 
and hence the approximate mode, by the method of % 20. 

3. Similarly, find the mean, median, and approximat* value of the mode 
for the distribution of fecundity in race-horsee, Table X., Chap. VI, 

4. Uaing a graphical method, find the median annual value of houses asgessed 
to inhabited house duty in the financial year 188G-S from the data of Table 
IV., Chap. VI. 

6. (Data from Sanerheck, Jour. Rtn/. Slat. Soe.,MiM,l.lW9.) The figures 
in eolnmns 1 and 2 of the email table below show the index -numbers (or per- 
oentagee) of prices of certain animal foods in the years 18H8 and 1908, on 
their average prices during the years 1887-77. In column 3 have been added 



the ratioB of the indez-nnmhers in IdnS t 
latter being taken as 100. 

Find the average ratio of prices in ISOfi 

(1) From the arithmetic mean of the rs 

(2) From the latio of the arithmetic mi 

(3) From the ratio of the geometric me 

(4) From the geometric mean of the ral 
Note that, by § 25, the last two methods mnat give the same resnlt. 



B index-numbers ii 

« prices in 1898, tab 
OS in col. S. 
ns of cols, 1 and 2. 
ia of cola. 1 and 2. 



the 







Index- number of prioe in . Batio 
1888. 1908. 08/98. 




1. 


2. 


3. 


1. Beef, prime 

2. Beer, middling , 

5. Mntton, prime . 
4. Mutton, middling 
fi. Port . 

6. Baoon 

t Bntler . . 




78 
72 
84 
87 
87 
78 
78 


88 
90 
92 
96 
83 
8i 
91 


112-8 

126-0 
109-6 
141-8 
BG'4 
1077 













132 



THEORY OF STATISTICS. 



6. (Data from census of 1901.) The tublo beluw shows the popul&tion of 
the rural Banitary districta of Essei, the urban sanitary dUtriots (other tlian 
the borough of West Ham), and the borough of West Ham, at Uie censuaaB 
of 1801 and 1901. Estimate the total population of the count; at a date 
midway between the two oensusea, (1) on the assumptiou that the peiaentage 
rate of increase is constant for the eonnty a» a whole, (2) on the assumption 
that the percentage rate of increase is constant in eaoli gronp of district* and 
the borough of West Ham. 



Essex. 


Population. 


1891. 


1801. 


Rnral districts 

West Ham .... 

Other urban districts . 

Total . 


282,867 
204,908 
345,604 


240,778 
287,868 
fi7e,8S4 


788,374 


1,083,998 



7. (Data from AgrKuUiiTol StaHHia for 1906, Cd. SOfll, 1906.) The 
following statement shows the monthly average prices of eggs in Great 
Britain in 1905, as compiled from the weekly returns of market prices for 
first and second quality British eggs, per 120 : — 



Month. 


First 
Quality. 


Second 
Quality. 




8. d. 


s. d. 


January 


18 


n 






11 


9 






8 


6 


April 




7 6 


8 e 


mV - . 




8 


7 8 










July . . 






8 6 








10 






11 a 


10 6 


October . 




14 


12 6 






18 


16 


December 




17 6 


15 


Mean for jea 




11 6i 


10 01 



What would have been the mean price for the year in each case if the whole- 
sale prices had been recorded in the same way ae retail prices, i,e, at so man; 
e^s per shilling t Slate your answer in the form of the equivalent price per 
120, and obtain it in the shortest way by taking the harmonic mean of the 
aboveprices («/. § 27). 

8. Supposing the frequencies of valnee 0, 1, 2, ... of a variable to be 
given by the terms of the binomial series 



where ji+3=l, find tbemeai 



-v-v. . 



(ji-vGooglc 



CHAPTER VUI. 

UEASITBES OF DI8PEBSI0N, ETO. 

1. Inadequftcy of the range «b a meaaare of dispeiwon— 2-13. The ataadord 
dsviaUon : its deSnition, calculation, and properties— 11-19. The 
mean deTiatioii : its definition, oalculation, and properties— 20-2*. The 
qnartile deviation or semi- interquartile range— 25. Meaauree of 
relative dispersian — 26. Ueaauras of asjmmetry or gkeivneBa— 27-30. 
The method of grades or peroentileB, 

1. Tbb simplest possible measure of the dispersion of a series of 
^ values of a variable is the actual rai^, i.e. the difference between 
the greatest and least values observed. While this is frequently 
quoted, it is as a rule the worat of all possible measures for any 
serious purpose. There are seldom real upper and lower limits 
to the possible values of the variable, very large or very small 
valu^ being only more or less infrequent : the range is therefore 
subject to meaningless fluctuations of considerable magnitude 
according as values of greater or less infrequency happen to 
have been actually observed. Note, for instance, the figures of 
Table IX., Chap. VI. p. 95, showing the frequency distributions of 
weights of adult males in the several parts of the United King- 
dom, Id Wales, one individual was observed with a weight of 
over 280 lbs., the nest heaviest being under 260 lbs. The 
addition of the one very exceptional individual has increased the 
range by some 30 lbs., or about one-fifth. A measure subject to 
erratic alterations by casual influences in this way is clearly not 
of much use for comparative piirposes. Moreover, the measure 
takes no account of the form of the distribution within the limits 
of the range ; it might well happen that, of two distributions 
covering precisely the same range of variation, the one showed 
the observations for the most part closely clustered round the 
averse, while the other exhibited an almost even distribution of 
frequency over the whole range. Clearly we should not regard 
two such distributions as exhibiting the same ditp^tion, though 
they exhibit the same rarige. Some sort of measure of dispersion 
is therefore required, based, like the averages discussed in the last 



134 THEORY OF STATISTICS. 

chapter, on all the observations made, so that do single observation 
can have an unduly preponderant effect on its magnitude ; indeed, 
the measure should possess all the properties laid down as desir- 
able for an average in S i of Chap. VTI. There are three such 
measures in common use— the stftndard deviation, the mean 
deviation, and the quartile deviation or Bemi-interquartile range, 
of which the first is the most important. 

2. Th^ Standard Deviation^-^h.^ standard deviation is the 
squarejswtpf ^^^ arithmetic mean of the squar es of all deviations^ 
■J^iations being measured from the arithmetic mean of the 
obseiyatiions. If the standard deviation be denote^ by 17, and a 
deviation from the arithmetic mean by a^ as in the last chapter, 
then the standard deviation is given by the equation 

,'.^(«>) . . . . (1) 

To square all the deviations may seem at first sight an artificial 
procedure, hut it must be remembered that it would be useless to 
take the mere sum of the deviations, in order to obtain a measure 
of dispersion, since this sum is necessarily zero if deviations be 
taken from the mean. In order to obtain some quantity that 
shall vary with the dispersion it is necessary to average the . 
deviations by a process that treats them as if they were all of the 
same sign, and squaring is the simplest process for eliminating 
signs which leads to results of algebraical convenience. 

3. A quantity analogous to the standard deviation may be 
defined in more general terms. Let A be any arbitrary value of 
X, and let f {as in Chap. VII. § 8) denote the deviation of X 
from A ; i.e. let 

i=X^A. 
Then we may define the root-mean-square deviation « from the 
origin A by the equation 

.'.ijW. ... (2) 

In terms of this definition the standard deviation is the root- 
mean-square deviation from the moan. There is a very simple 
relation between the standard deviation and the rootrmean-square 
deviation from any other origin. Let 

M-A = d (3) 

BO that ^=a! + d. 

Then ^'^ = :)? ^ix.d+<P, 

2(^2) = 5(a;S) + 2d.%{x) + N.d^. , ,, ., G00<^IC 



MKA8URK8 OF DISPKR810N, KTC. 136 

But the sum of the deTtatdons from the mean is zero, therefore 
the second term vanishes, and acoordingly 

»^-.a^ + <P. ... (4) 

Hence the root-mean-Bquare deviation is least whea deviations 
are meaaured from the mean, i.e. the standard deviation is the least 

possible root-raean-square deviation. 

S(i^), or %(f.^) if we are dealing with a grouped distribution 
and / is the frequency of £, is sometimes t«rmed the tetond moment 
of the distribution about A, just as 2(i) or 2(/'.f) is termed 
the first moment (c/. Chap, VII. § 8) : we shall not make use 
of the term in the present work. Generally, 5(/"-^) is termed 
the nth moment. 

4. If 0- and d are the two sides of a right-angled triangle, i is 




the hypotenuse. If, then, MH be the vertical through the 
mean of a frequency-distribution (fig. 25), and MS be set off 
equal to the standard deviation (on the same scale in which the 
variable X is plotted along the base), SA will be the root-mean- 
square deviation from the point A. This construction gives a 
concrete idea of the way in which the root-mean-squsre deviation 
depends on the origin from which deviations are measured. It 
will be seen that for small values of d the difference of * from o- 
will be very minute, since A will lie very nearly on the circle 
drawn through JSi with centre S and radius SM: slight errors 
in the mean due to approximations in calculation will not, there- 
fore, appreciably affect the value of the standard deviation. 

6. If we have to deal with relatively few, say thirty or forty, 
ungrouped observations, the method of calculating the standard 
deviation is perfectly straightforward. It is illustrated by the 
figures given below for the estimated average earnings of 



136 THEORY OF STATISTICS. 

agricultural labourers in 38 rural unioas. The values (earnings) 
are first of all totalled and the total divided by ^ to give the 
arithmetio meao if, ria. 15s. UJ^-, or 15s. lid. to the nearest 
penny. The earnings being estimates, it is not necessary to take 
tbe avert^e to any higher degree of accuracy. Having found 
the mean, the difference of each observation from the mean is 
next written down as in col. 3, one penny being taken as the 
unit : tbe signs are not entered, as they are not wanted, but the 
work shoitld be checked by totalling the positive and n^ative 
differences separately. [The positive total is 300 and the 
negative 390, thus checking the value for the mean, viz. 15s. 
lid. + 10/38.] 

Finally, each difference is squared, and the squares entered in 
col, 4, — tables of squares are useful for such work if any of the 
differences to be squared are large (see list of Tables, p. 363). 
The sum of the squares is 16,018. Treating the value taken for 
the mean as sensibly accurate, we have— 

o-=20-5d. 

If we wish to be more precise we can reduce to the true mean 

by the use of equation (4), as follows : — 

d=. 1^ = 0-2632; rf== 0693 

Hence a^ = si.d' = 421 -4570 

<r= 30'529rf. 

Evidently this reduction, in the given case, is unnecessary, 
illustrating the fact mentioned at the end of ^ 4, that small 
errors in the mean have little effect on the value found for the 
standard deviation. The first value is correct within a very 
small fraction of a penny. 



n,gN..(JNGOOglC 



MKASURBS OF DISPIR8I0N, ETC. 



Calcclatiom of thb Standaiw Deviation; Example i.—Caleuiation of 
Mtan and Standard Diviatimi for a Short Stria of Obttrvaliont un- 
groaped. EatimaUd Amragt n^tticly Eamingi of Agricultural Labourers 
■ ( Thirly-eighl Jturoi Unioni, in 1891S. (W. Little ; Laiour Com- 



mittum; S^iort, 


™l. T 


,part 


1,, 18B4.) 






I. 
OniOD. 


2. 

EanuDgs 
(Shillings 
and Pence). 


3. 

Difference 
t (Pence). 


*. 
{Difference)* 


1. Glendale .... 


a. d. 
20 8 


f-7 68 


S.364 


2. Wigton 








20 3 


n, 62 


2.704 


8. GareUng . 








19 8 


•''^ 46 


2,026 


*. Bdper 








18 6 


l« 31 


881, 


6. Nantwieh . 








17 8 


(^ ai 


441 


6. Atcham 








17 6 


.'■ 19 


361 


7. Driffield . 








17 1 


/-'■ 14 


188 


8. DttozBter . 








37 


'1 13 


168 


.,8. Wetherbv . 








17 


jl 13 


188 


10. Eaaingwold 








le 11 . 


1 1 12 


144 


11. Southwell . 








16 6 


U 7 


49 


12. HoUingbourn 








18 4 


>f 6 


25 


13. Melton Mowbray 






16 S 




16 


14. Traro 






16 3 


^ 4 


16 


IG. Godatone . 






16 


1 


1 


16. Looth 








16 


0-^ 1 


1 


17. Briiworth . 








16 8 


'"'l 


4 


18. Crediton . 








16 8 


8 


18. Holbeach . 








16 6 


(. 6 


26 


20. Ualdon 








16 fl 


If 6 


25 










16 t 


5 I 


48 


22. St Naots . 








16 8 


64 










15 


i-r 11 


121 


24, Thakeli^ni. 








15 


/-• 11 


121 


75. Thame 








15 


IT n 


121 


26. Thingoe , 








16 


i-y 11 


121 










16 


I T 11 


121 










16 




121 


28. K.Witchford 






14 10 


,-■■ 18 


169 


80. Pew»ey . 






14 9 


] . 14 


196 


31. Bromyard . 






14 e 


i -. 14 


186 


32. Wantage . 








14 e 


'-' 14 


IBS 


33. Str»tford-oii-Av 








14 7 


'■' 18 


266 










14 8 




36. Wobam . 








14 6 


•■• 17 


2SH 










14 4 


- • 18 


381 


87. Perahow . 








IS 6 


-- • 29 


841 


3S. I^ngport , 








12 6 


' > " 


1.881 




Toto 






605 8 { 


+ 300 
-280 


1 18.018 



188 THEORY OF STATISTICS. 

The figures dealt with in this illustration are estimates a! the 
weekly eamingt of the agricultural labourers, i.e. they include 
allowances for gifts in kind, such as coal, potatoes, cider, etc. The 
estimated weekly money wag^ are, however, also given in the 
same Report, and we are thus enabled to make an interesting 
comparison of the dispersions of the two. It might be expected 
that earnings would vary less than w^ea, as his eamings and not 
the mere money wages he receives are the important matter to 
the labourer, and as a fact we find 

Standard deviation of weekly eamings . . . 30M. 
„ ,, „ wages . 26'Od. 

The arithmetic mean w^e is 13s, 5d. 

6. It we have to deal with a grouped frequency-distribution, 
the same artifices and approximations are used as in the calculation 
of the mean (Chap. VII. §§ 8, 9, 10). The mid-value of one of 
the class-in tervals is chosen as the arbitrary origin A from which 
to measure the deviations i, the class-interval is treated as a 
unit throughout the arithmetic, and all the observations within 
any one class-interval are treated as if they were identical with 
the mid-value of the interval. If, as before, we denote the 
frequency in any one interval by /, these / observations con- 
tribute f^ to the sum of the squares of deviations and we 
have — 

The standard deviation is then calculated from equation (4). 

7. The whole of the work proceeds naturally as an extension of 
that necessary for calculating the meau, and we accordingly use 
the same illustrations as in the last chapter. Thus in Example 
ii. below, cols. 1, 2, 3, and i are the same as those we have already 
given in Example i. of Chap. VII. tor the calculation of the mean. 
Column 5 gives the figures necessary for calculating the standard 
deviation, and is derived directly from col. 4 by multiplying the 
figures of that column again by f . Thus 90x5 = 450, 192x4 = 
768, and bo on. The work ia therefore done very rapidly. The 
remaining steps ot the arithmetic are given below the table ; the 
student must he careful to remember the final conversion, if 
necessary, from the cla^-interval as unit to the natural unit 
of measurement. In this case the value found is 248 class- 
intervals, and the class-interval being half a unit, that ia 1'24 
per cent. 

n,gN..(JNGOOglC 



MEASURES OT DISPERSION, ETC. 



Calculation or the Standard Dbviation : ExampU ii.—CalculaH<m of 
the Standard Dematiim of the FtrceTiiagea of the Pepulatioa in reaipt of 
Belief, in oMUion to the Mean, from the figurei of Table VIII. of 
Chap. VI. [Cf. the work for the mean alone, p. 111.) 



(1) 


(2) 


(8) 


w 


(6) 


Percentage 
in receipt 

ofEeUef. 


f. 


Deviatian 
from Value.,* 


Produol, 
A- 


Product. 


1 


18 


- 6 


80 


450 


1-6 


48 


- 4 


192 


768 


2 


73 


- 3 


216 


S48 


2-6 


89 _ 




178 


356 


8 

3-6 
4 


100 
90 
76 


- 1 



+ 1 


100 


100 

75 


-778 


76 


4-5 


60 


+ 2 


120 


240 


i 


40 


+ 3 


120 


360 


6-6 


21 


+ 4 


84 


336 


e 




+ .-i 


65 


27S 


S-6 

7 

7-5 


6 

1 


+ 6 
+ 7 
+ S 


30 
7 
8 


180 
49 
64 


8 




+ 9 






8-6 
Total 


"l 


+ 10 


10 


100 


S32 


- — 


+ 609 


4001~^ 


















(.^fc i~. 





work, p. Ill, M-A = d= -0'4225 class-inter vala. 



.". (T =2"4'B mtervala=I'24 per cent. 

To illustrate again the value of the standard deviation for 
purpoBee of comparison, figures are given below showing the 
means and standard deviations of similar distributions for a series 
of years from 1850. It will be seen that not only did the mean 
decrease during the period, but the standard deviation decreased 
to an equally marked extent, having been halved between 
1850 and 1891 ; the average was lowered, and at the same time 
the percentages of the population in receipt of relief clustered 
much more closely round the lower average. t t)(iuk' 



THEORY OF STATISTICS. 



Mtans and Standard Dtoiationi of tht Ditlribvtiaia of Pauperism (Percentage 
of the Population in receipt of Foor-laia Kelief) in the Unions of England 
and Wales since 1860. (From Yule, Joar, Bey. Stat. Soc., vol. lit, 
1896, figures alightl; amended.) 



Year. 


Feroentage ot tbe Popalation 
in receipt of Belief. 


Mean. 


SUndard 


18&0 
ISSO 
1870 
1881 
1S91 


651 
5-20 
5-« 
3-S8 
3-29 


2-50 
■^■07 
2-02 

jsa 

1-24 



8. In the table given on p. 141 (Example iii.), the calculation of 
the standard deviation is similarly shown for the distribution of 
the statures of adult males in the British Isles, the work being 
continued from the stage which it reached for the calculation of 
the mean in Example ii. of Chap. VIL The steps of the arith- 
metic hardly call for further eiplanation, but it may be noted that 
the class-interval being a unit in this case, no conversion of 
the standard deviation from olass-iutervals to units is required. 

9. The student must remember, as in the case of the calculation 
of the mean, that the treatment of all values within each clase- 
interval as if they were identical with the mid-value of the interval 
is an approximation and no more {cf. Chap. VII. § 11), though, 
for a distribution of the symmetrical or moderately asymmetrical 
type with a class-in tervaj not greater than one-twentieth or so 
of the range, the approximation may be a very close one. But 
while the value of the arithmetic mean may be either increased 
or decreased by grouping, in the case of distributions which are 
not more tlian slightly asymmetrical, the standard deviation of 
such distributions always tends to be increased, and the increase 
is the greater the cruder the grouping. We give an approximate 
correction for this effect later (Chap. XI. § 3). The student is 
recommended bo test for himself the effect of grouping in two 
or three cases. 

10. It is a useful empirical rule to remember that a range of 
sis times tbe standard deviation usually includes 99 per cent, or 
more of all the observations in the caae of distributions of the 
symmetrical or moderately asymmetrical type. Thus in Exaii?pl« 



-^ ' - - - 

MEASURES OF DISPBRStON, ETC. ' 



Oaloiii^tion of the Standa&d Deviation ; Srampte Hi. — Caleulaii<m 
oftht Sltrndard Deviation of Stature of MaU Adults in the BrUitk laUi 
/rota the fiffwm of Tablt VL, p. 88. {Cf. p. 112 for the oalcuIatioD of 



(1) 


(2) 


(8) 
Deviation 


(4) 


(6) 


?as 


Frequenav. 


from 


Product. 


Prodnct 


/. 




M- 


/■I-. 


67- 


2 


-10 


20 


200 


68- 


4 


- 8 


86 


834 


69- 


U 


- 8 


112 


896 


60- 


41 


- 7 


287 


2,009 


81- 


88 


- 6 


498 


2,988 


62- 


169 


- 6 


846 


4.226 


6S- 


394 




1578 


6,304 


64- 


669 


- S 


2007 


6,021 


flS- 


S80 


- 2 


1980 


8,860 


as- 

67- 

68- 


1223 
1829 

12S0 


- 1 



+ 1 


1228 


1,223 
1,230 


-8584 


1280 


68- 


1063 


+ 2 


2126 


4,262 


70- 


846 


+ 8 


1838 


5,814 


71- 


892 


+ 4 


1668 


8.272 


72- 


202 


+ 6 


1010 


6,060 


78- 


79 


+ 6 


474 


2,844 


74- 


32 


+ 7 


224 


1,66S 


75- 


16 


+ 8 


128 


1,084 


78- 


6 


+ 9 


46 


406 


77- 
ToUl 


2 


+ 10 


20 


200 


6686' 


- 


+ 87«e 


68,806 



ot^,M-A = 



19 class-interr&Is or inchea. 



aS = 6-fll72-<a)209)^ 
= 66168. 
. '. =2'57 olsw-interyftli or inebts. 

ii. the standard deviation is r24 per cent. ; sis times this is 7-44 
per cent., and a range from 0-75 to 8'19 per cent, includes alt 
but one obBsrvation out of 632. In Eiample iii. tlie standard 
deviation ie 2'5T in., six times this ia 15'42 in., and a range from, 
say, 60 in. to 76-4 in. includes all but some 37 out of 8566 
individuale, i.e. about 99-6 per cent. This rough rule serves to 



142 TBEOBT OF STATI8TIC8, 

give a more definite and concrete meaning to the standard 
deviation, and also to check arithmetical work to some extent — 
sufficiently, that is to say, to guard ^ainat very gross blunders. 
It must not be expected to hold for short series of observations : 
in Example i., for instance, the aetual range is a good deal less 
than sis times the standard deviation. 

11. The standard deviation is the measure of dispersion which 
it is most easy to treat by algebraical methods, resembling in this 
respect the arithmetic mean amongst measures of position. The 
majority of illustrations of its treatment must be postponed to a 
later stage (Chap. XI,), but the work of g 3 has already served as 
one example, and we may take another by continuing the work of 
§ 13 (6), Chap. VII. In that section it was shown that if a series 
of observations of which the mean is M consist of two component 
series, of which the means are M■^ and Ma respectively, 

JIT, and JVg being the numbers of observations in the two com- 
ponent series, and N=N^ + If^ the number in the entire series. 
SimOarly, the standard deviation tr of the whole series may be 
expressed in terms of the standard deviations <ri and o-^ of the 
components and their respective means. Let 
M^-M=d^ 

Then the mean-square deviations of the component series -about 
the mean .^are, by equation (4), o-,* +d^ and a^ + d^ respec- 
tively. Therefore, for the whole series, 

N.<T^ = iTX-r/ + djS) + ]f^(,T^^ + d^^) . . (5) 

If the numbers of observations in the component series be equal 
- and the means be ooincident, we have as a special case — 

,^.iW,' + ,r,') . . . . (6) 

SO that in this case the square of the standard deviation of the 
whole aeries is the arithmetic mean of the squares of the standard 
deviations of its components. 

It is evident that the form ^f the relation (6) is quite general : 
if a series of observations consists of r component series with 
standard deviations <r„ o-,, . . . <r„ and means diverging from the 
general mean of the whole series by d,, d^, . . . d„ the standard 
deviation <r of the whole series is given (usit^ m to denote any 
subscript) by the equation — 

jr.„'.S(ir..„.>)+s(ir..,i.') . . (7) 
,, Lh)o<;Ic 



A- 



MEA8URK8 OF DISPERSION, KTC. 143 

Agoiu, as in § 13 of Chap. VII., it is convenient to note, for the 
checking of arithmetic, that if the same arbitrary origin be used 
for the calculation of the standard deviations in a number of 
oomponent diBtribntiona we must have 

2(/-f')-S(/i.fi^) + S0;.f,=)+ +i(/r.iJ^ . (8) 

12.. As another useful illustration, let us find the standard 
deviation of the first Jf natural numbers. The mean in this case 
is evidently (2f+l)/2. Further, aa is shown in any elementary 
Algebra, the sum of the squares of the first Jf natural numbers is 

6 

The standard deviation <r is therefore given by the equation — 

.T^ = K-iV+l)(2if+l)-'<ir+lf, 
that is, iT^ = ^{If^-l) . . . . (9) 

This result ia of service if the relative merit of, or the relative 
intensity of some character in, the difierent individuals of a series 
is recorded not by means of measurements, e.ff. marks awarded on 
some system of examination, but merely by means of their 
respective positions when ranked in order as regards the character, 
in the same way as boys are numbered in a class. With Jf 
individuals there are always JV ran/a, aa they are termed, 
whatever the character, and the standard deviation is therefore 
always that given by equation (9). 

Another useful result follows at once from equation (9), namely, 
the standard deviation of a frequency-distribution in which all 
values of X within a range ±1/2 on either side of the mean are 
equally frequent, values outside these limits not occurring, so that 
the frequency -distribution may be represented by a rectangle. The 
base^ mayb« supposed divided intoavery large number iV of equal 
elements, and the standard deviation reduces to that of the first if 
natural numbers when JV.is made indefinitely large. The single 
unit then becomes negligible compared with Sf, and consequently 



12 



(10) 



13. It will be seen from the preceding part^;raphs that the 
standard deviation possefflw the majolity at least of the properties 
which are desirable in a measure of dispersion as in an average 
(Cbap. VII. g 4). It is rigidly defined ; it is based on all the 
observations made ; it is calculated with reasonable ease ; it lends 
itself readily to algebraical treatment ; and we may add, though the 
student will have to take the statement on trust for the present, 
that it is, as a rule, the measure least affected by fluctuations of 



144 THBOBY OP STATISTICS. 

BampLicg. On the other hand, it may be said that its general 
nature is not very readily comprehended, and that the process of 
squaring deviations and then taking the square root of the mean 
seems a little involved. The student will, however, soon surmount 
this feeling after a little praetice in the caloulatioti and use of the 
constant, and will realise, as he advances further, the advantages 
that it possesses. Such root-mean-square quantities, it may be 
added, frequently occur in other branches of science. The 
standard deviation should always be used as the measure of dispet- 
Bion, unless there is some very definite reason for preferring another 
measure, just as the arithmetic mean should be used as the measure 
of position. It may be added here that the student will meet with 
the standard deviation under many different names, of which we 
have adapted the most recent (due to Pearson, ref. 2) : many of 
the earlier names are hardly adapted to general use, as they bear 
evidence of their derivation from the theoiy of errors of obaervation. 
Thus the terras "mean error" (Gauss), "error ot mean square" 
(Airy), and " mean square error " have all been used in the same 
sense. The square of the standard deviation, and also twice the 
square, have been termed the "fluctuation" (Edgeworth) : the 
standard deviation multiplied by the square root of 2, the 
"modulus" (Airy), — the student will see later the reason for 
the adoption of the factor. The reciprocal of the modulus has 
been termed the "precision" (Leiia). 

14. Ths Mean Deviation. — The mean deviation of a series of ■ 
values of a variable is the arithmetic mean of their deviations 
from some average, taken without regeird -to their sign. The 
deviations may be measured either from the arithmetic mean or 
from the median, but the latter is the qfitural origin to use. Just 
as the root^mean-square deviation ia least when deviations are 
measured from the arithmetic mean, so the mean deviation is 
least when deviations are measured from the median. For 
suppose that, for some origin exceeded by m values out of N, the 
mean deviation has a value A. Let the origin be displaced by 
an amount c until it is just exceeded by m ~ 1 ot the values only, 
i.e. until it coincides with the mth value from the upper end of 
the series. By this displacement of the origin the sum of devia- 
tions in excess of the mean is reduced by m.c, 'while the sum of 
deviations in defect of the mean is increased by {N-m)e. The 
new mean deviation ia therefore 



A + 



{N-m)c~m 



i + i(Jir_2™).. 






p^\!4\ <" g^SMP, "C. ' 145 

f Tb^'hew i»afin\d^\yi^^^ accordingly less than the old bo long as 

That is to say, if iT be even, the mean deviation is constant for 

all origins within the range between the ^/2th and the (JW2-}-l)th 
observations, and this value is the least : if iV be odd, the mean 
deviation is lowest when the origin coincides with the (ir+ l)/2th 
observation. The mean deviation is therefore a minimum when 
deviations are measured from the median or, it the latter be 
indeterminate, from an origin within the range in which it lies. 

15. The calculation of the mean deviation either from the mean 
or from the median tor a series of ungrouped observations is very 
simple. Take the figures ot Eiample i. (p. 1 37) as an illustration. 
We have already found the mean (ISs. lid. to the nearest penny), 
and the deviations from the mean are written down in column 3. 
Adding up this column without respect to the sign of the devi- 
ations we find a total of 590. The mean deviation from the mean 
is therefore 590/38 = I5-53d. The mean deviation from the 
median is calculated in precisely the same way, but the median 
replaces the mean as the origin from which deviations are measured. 
The median is 15s. 6d. The deviations in pence run 63, 67, 50, 
36, and so on ; their sum is 570 ; and, accordingly, the mean 
deviation from the median is 15d. exactly. 

16. In the case ot a grouped trequency-diatribution, the sum 
ot deviations should be calculated first from the centre ot the 
class-interval in which the mean {or median) lies, and then 
reduced to the mean as origin. Thus in the case of Example ii. 
the mean ia 329 per cent, and lies in the class-interval centring 
round 3-5 per cent. We^have already found, that the sum of 
deviations in defect of 35 pA cent, is 776, and ot deviations in 
escesB 509 : total (without regard to sign) 1285, — the unit ot 
measurement being, of courae, as it is necessary to remember, the 
class-interval. If the number ot observations below the mean is 
2/\ and above the mean ^^i ^^^ M - A=d, as before, we have to 
add ^.d to the sum found and subtract N^-d. In the present 
case .^1 = 327 and Jrj = 305, while d= -0-42 class-inter vala, 
therefore 

d{N^ - iVj) = - 0-42 x 32 = - 9-2, 

and the sum of deviations from the m 
Hence the mean deviation from the r 
class-intervals, or I'Ol per cent 

17. The mean deviation from the median should be found in 
precisely similar fashion, but the mid-valne of the interval in 
which the median (instead of the mean) lies should, for oon- 

10 



146 THEOHT OF STATISTICS, 

venience, be takea aa origin. Thus in Example ii. tlie median is 
(Chap. Vn. S 15) 3-195 per cent. Hence 3-0 per cent, should be 
taken as the origin, rf = + 0*39 interyals, JT, = 327, N^ = 305. The 
deviation-sum with 30 aa origin is found t« be 1263, and the 
correction ia +0 39 x 22= +8-6. Hence the mean deviation 
from the median is 2'012 intervals, or again I'Ol per cent. The 
value ia really sofaller than that of the mean deviation from the 
arithmetic mean, but the difference is t^x} slight to affect the 
second place of decimals. 

It should be noted that, as in the ease of the standard deviation, 
this method of calculation implies the assumption that all the 
values of X within any one clasa-interval may be treated as if 
they were the mid-value of that interval. This ia, of course, an 
approiimation, but as a rule gives results of amply sufficient 
accuracy for practice if the class-interval be kept reasonably small 
(c/. again Chap. VI. g 5). We have left it as an exercise to the 
student to find the correction to be applied if the values in each 
interval are treated as if they were evenly distributed over the 
interval, instead of concentrated at its centre (Question 7). 

18. The mean deviation, it will be seen, can be calculated rather 
more rapidly than the standard deviation, though in the case of a 
grouped distribution the difference in ease of calculation is not 
great. It is not, on the other hand, a convenient magnitude for 
algebraical treatment ; for example, the mean deviation of a dis- 
tribution obtained by combining several others cannot in general 
l>e expressed in terms of the mean deviations of the component 
distributions, but depends upon their forms. As a rule, it ia more 
affected by fluctuations of sampling than ia the standard deviation, 
but may be lesa affected if targe and erratic deviations lying 
somewhat beyon'd the bulk of the distribution are liable to occur. 
This may happen, for example, in some forms of experimental 
work, and in such cases the use of the mean deviation may be 
slightly preferable to that of the standard deviation. 

1 9. It/ia a useful empirical rule tor the student to remember 
that f^ symmetrical or only moderately asymmetrical distri- 
butions, approaching the ideal forma of figa. 5 and 9, the mean 
deviation is usually very nearly four-fifths of the standard devia- 
tion. Thus for the distribution of pauperism we have 

mean deviation I'Ol ^ „.. 
standard deviation 1'21 
In the case of the distribution of male statures in the British 
Isles, Example iii., the ratio found is 0-80. For a short aeries of 
observations like the wage statistics of Example i. a regular result 
could hardly be expected ; the actual ratio ia ISjO/^-^pfiO 73. 



HEAaUBBS OF DIBPEBSION, ETC. 147 

We pointed out in g 10 that in diBtributions of the simple forma 
referred to, a range of six times the standard deviation contains 
over 99 per cent, of all the observations. If the mean deviation 
be employed as the measure of dispersion, we must substitute a 
range of 7^ times this measure. 

20. The Qua/rtile Deviation or Semi-interqvartile Range. — If a 
value Q^ of tbe variable be determined of such magnitude that 
one-quarter of all the values olwerved are less thau Q^ and three- 
quarters greater, then Qj is termed the lower quartile. Similarly, 
if a value Q^ be determined such that three-quarters of all the 
values observed are less than Q^ and one-quarter only greater, 
then Qg is termed the upper quartile. The two quartiles and the 
median divide the ol^erved values of the variable into four 
classes of equal frequency. If Mi be the value of the median, in 
a symmetrical distribution 

J/i - Qj = Qg - Mi, 

and the difference may be taken as a measure of dispersion. But 
as no distribution is i^igidly symmetrical, it is usual to take as the 

measure 

V- 2 ■ 

and Q is termed the quartile deviation, or better, the semi- 
interquartile range — it is not a measure of the deviation from 
any particular average : the old name probable error should be 
confined to the theory of sampling (Chap. XV. § 17). 

21. In the case of a short aeries of ungrouped observations 
the quartiles are determined, like the median, by inspection. 
In the wage statistics of Example i., for instance, there are 
38 observations, and 38/4 = 9'5 ; What is the lower quartile % 
The student may be tempted to take it halfway between the 
ninth and tenth observations from the bottom of the list ; 
but this would be wrong, for then there would be nine 
observations only below the value chosen instead of 9'5. The 
quartile must be taken as given by the tenth observation 
itself, which may be regarded as divided by the quartile, and 
falling half above it and half below. Therefore 

Lower quartile $^= 14s. lOd. 
Upper quartile Qg= 168. lid. 



22, In the case of a grouped distribution, the quartiles, like 
the median, are determined by simple arithmetical or by 



148 THEORY OF STATISTICS. 

graphical interpolation (c/. Chap. VII. g§15, 16). Thus for the 
diBtribution of pauperism. Example ii., we have 

632 -=-4 =168 
Total frequency under 2-25 per cent. = 138 



Frequency 


n intoTal 2-26 - 


Difference 
276 


- 20 
. 89 






Wlence «, 


- 2-25 + |?x 0-5 






. 2-362 


per 


cent. 


Simitorlj . 
Hence 


e find Qj 


«-«- 


a— 


.4130 
.0-884 







It is left to the student to check the value by graphical 
interpolation. 

23. For distributions approaching the ideal forma of figa. 
5 and 9, the eemi-interquartile range ia usually about two-thirds 
of the standard deviation. Thus for Example ii. we find 



g_ 0-884 
o- 1-24 



= 0-71. 



The distribution of statures, Example iii., gives the ratio 068. 
The short series of w^e statistics in Example i. could not be 
expected to give a result in very strict conformity with the 
rule, but the actual ratio, viz. 0-61, does not diverge greatly. 
It follows from this ratio that a range of nine times the semi- 
interquartile range, approsimately, is required to cover the same 
proportion of the total frequency (99 per cent, or more) as a range 
of six times the standard deviation. 

24. Of the three measures of dispersion, the semi-interquartile 
range has the most clear and simple meaning. It is calculated, 
like the median, with great ease, and the quartiles may be found, 
if necessary, by measuring two individuals only. If, e.g., the 
dispersion as well as the average stature of a group of men 
is required to be determined with the least possible expenditure 
of time, they may be simply ranked in order of height, and the 
three men picked put for measurement who stand in the centre 
and one-quarter from either end of the rank. This measure of 
dispersion may also be useful as a makeshift if the calculation 
of the standard deviation has been rendered difhcult or impossible 
owing to the employment of an irregular classification of the 
frequency or of an indefinite terminal class. Such uses are, 
however, a little exceptional, and, generally speaking, the 



MB&SURE8 OF DlSPflRBlON, ETC. 149 

semi-interquartile range as a measure of diaperaion is not to be 
recommended, unless simplicity of meaning in of primary im- 
portance, owing to the leick of algebraical convenience which 
it shares with the median. Further, it is obvious that the 
quartile, like the median, may become indeterminate, and that 
the use of this measure of dispersion is undesirable in cases of 
discontinuous variation : the student should refer t^ain to the 
discussion of the similar disadvant^e in the case of the median. 
Chap. VII. § H. It has, however, been largely used in the past, 
particularly for anthropometric work. 

25. Sfeofures of Relative Dispersion. — As was pointed out in 
Chapter VII. § 26, if relative size is regarded as influencing not only 
the average, but also deviations from the average, the geometric 
mean seems the natural form of average to use, and deviations 
should be measured by their ratios to the geometric mean. As 
already stated, however, this method of measuring deviations, with 
its accompanying employment of the geometric mean, has never 
come into general use. It is a much more simple matter to allow 
for the influence of size by taking the ratio of the measure of 
absolute dispersion (e.j/. standard deviation, mean deviation, or 
quartile deviation) to the averse (mean or median) from which 
the deviations were measured. Pearson has termed the quantity 

..100.-J, 

i.e. the percentage ratio of the standard deviation to the arithmetic 
mean, the Meffident of variation (ref. 6), and has used it, for 
example, in comparing the relative variations of corresponding 
organs or characters in the two sexes : the ratio of the quartile 
deviation to the median has also been suggested (Versohaeffelt, 
ref. 7). Such a measure of relative dispersion is evidently a mere 
number, and its magnitude is independent of the units of 
measurement employed. 

26. Meaiurei of Aiymmxtry or Skewnest. — It we have to compare 
a series of distributions of varying degrees of asymmetry, or skew- 
oeSB, as Pearson has termed it, some numerical measure of this 
character is desirable. Such a measure of skewness should 
obviously be independent of the units in which we measure the 
variable— «,y. the skewness of the distribution of the weights of a 
given set of men should not be dependent on our choice of the 
pound, the stone, or the kilogramme as the unit of weight — and 
the measure should accordingly be a mere number. Thus the 
difference between the deviations of the two quartiles on either 
side of the median indieatet the existence of skewness, but to 
meauure the degree of skewness we should take the ratio oiit^is 



160 THEOET OF STATISTICS. 

diSerance to some quantity of the same dimenaioDS, e.g. the semi- 
interquaTtile range. Our measure would then he, taking the 
akewnesB to be positive if the longer tail of the distribution runs 
in the directioD of high values of X, 

.kewn«„.<&^^J>---<y-«-l>.«.-^fe-"- . (li; 

This would not be a bad measure if we were using the quartile 
deviation as a measure of dispersioD : its lowest value is zero, 
when the distribution is symmetrical ; and while its highest possible 
value IB 2, it would rarely in practice attain higher numerical 
values than ±1. A similar measure might be based on the mean 
deviations in excess and in defect of the mean. There is, however, 
only one generally recognised measure of skewness, and that is 
Pearson's measure (ref. 8) — 

standard deviation ' ■ \ I 

This is evidently aero for a symmetrical distribution, in which 
mode and mean coincide. No upper limit to the ratio is apparent 
from the formula, but, as a fact, ttie value does not exceed unity for 
frequency-distributions resembling generally the ideal distributions 
of fig. 9. As the mode is a difficult form of average to determine 
by elementary methods, it may be noted that the numerator of the 
above fraction may, in the case of frequency-distributions of the 
forms referred to, be replaced approximately by 3(mean - median), 
{ef. Cdap. VII. | 20). The measure (12) is much more sensitive 
than (11) for moderate degrees of asymmetry. 

27. The Method of Percentiles. — We may conclude this chapter 
by describing briefly a method that has been largely used in the 
past in lieu of the methods dealt with in Chapters VI. and VII., 
and the preceding paragraphs of this chapter, for summarising 
such statistics as we have been considering. If the values of the 
variable (variates, as they are sometimes termed) be ranged in 
order of mt^nttude, and a value P of the variable be determined 
such that a percentage p of the total frequency lies below it and 
100 -p above, then P is termed a percentile. If a series of per- 
centiles be determined for short intervals, e.g. 5 per cent, or 10 
per cent., they suffice by themselves to show the general form 
of the distribution. This is Sir Francis Qalton's method of 
percentiles. The dedles, or values of the variable which divide 
the total frequency into ten equal parts, form a natural and 
convenient series of percentiles to use. The fifth decile, or vdue 
of the variable which has 60 per cent, of the observed values 



MBA8DRE8 OF DISPBIlSION, ETC. 



151 



above it and 50 per cent, below, is the median : the two (^uartiles 
He between the seoond and third and the seventh and eighth 
deciles reapectively. 

28. The deciles, like the median and quartiles, may be 
determined either by arithmetical or by graphical ^^rpolation, 
excluding the cases in which, like the former i ' " 

become indeterminate (e/. g 24), It is hardly c 
an illustration of the former process, aa the met 
the aame as for median and quartiles (Chap. VII. i, 
§ 22). Fig. 26 shows, of course on a very much reduced scale, the 

























u" 
































h 










/ 










» 


fi 








/ 












-7 


II 








/ 












' ^ 








/( 






































/ 














.2 


\i ' 




/ 


















i4 























Kio. 26.— Curre shoning the number of Dietricts of England and Wales in 
which the Pauperiem on let January 1891 did not exceed any given per- 
' " ' " ' ' ■ i Fig. 10, p. 92): graphical 



curve used for obtaining the deciles by the graphical method in 
the case of the distribution of pauperism (Example ii. above). 
The figures of the original table are added up step by step from 
the top, BO as to give the total frequency not esoeeding the upper 
limit of each clasa-interval, and ordinates are then ereuted to a 
horizontal base to represent on some scale these integrated 
frequmdeg : a smooth curve is then drawn through the tops of 
the ordinates ao obtained. This curve, as will be seen from the 
figure, rises slowly at first when tbe frequencies are small, then 
more rapidly as they increase, and finally turns. over again and 
becomes quite Sat as the frequencies tail ofi' to zero. The dedles 



152 



THEORY OP STATISTICS. 



may be readily obtained from auch a curve by dividing the 

terminal ordinate into ten equal parts, and projecting the points 
80 obtained horizontally across to the curve and then vertically 
down to the base. The construction is indicated on the figure for 
the fourth decile, the value of which is approsimatelj 2'88 per cent. 
29. The curve of fig. 26 may be drawn in a different way by 
taking a horizontal base divided into ten or a hundred equal 
parts (grades, as Sir Francis Gal ton has termed them), and erecting 
at each point so obtained a vertical proportional to the cor- 
responding percentile. This gives the curve of fig. 27, which was 
obtained by merely redrafting fig. 26. The curve is of so-called 



•w so 



3 -;,-= 2 

/_ ^ -^ „ _ __ . — ; 

o 



ao 4o so eo 



BO 90 lOO 



ogive form. The ogive curve for the distribution of statures 
(Example iii.) is shown for comparison in fig. 28. It will be noticed 
that the ogive curve does not bring out the asymmetry of the 
distribution of pauperism nearly so clearly as the frequency- 
polygon, fig. 10, p. 92. 

30. The method of percentiles has some advantages as a method 
of representation, as the meaning of the various percentiles is so 
simple and readily understood. An estension of the method to 
the treatment of 'non-measurable characters has also become of 
some importance. For example, the capacity of the different boys 
in a class as regards some school subject cannot be direcUy 
measured, but it may not be very difficult for the master to 



MEASURES OF DISPEBS(ON, GTC. 



153 



arrange them in order of merit as regards this character : if the 
boys are then " numbered up " in order, the number of each boy, 
or hia rank, serves aa some sort ot index to his capacity {ef. the 
remarks in g 12. It should be noted that rank in this sense is 
not quite the same as grade ; if a boy is tenth, say, from the 
bottom in a claas of a hundred his grude is 9'.'>, but the method 
IB in principle the same with that of grades or percentiles). 
The method of ranks, grades, or percentiles in such a case may 
be a very serviceable auxiliary, though, of course, it is better if 
possible to obtain a numerical measure. But if, in the case of a 
measurable character, the percentiles are used not merely aa 









2 





< 




4 




i 


■} 


6 


» 


7 


' t 


9 S 


1 


90 






















J 






















y 


/ 




l„ 














^ 


















^ 


^ 






■ 








I, 




^ 




















^„ 


f 














































~l 





































Fio. 28.— Ogire Curve for Stature, a 



le data as Tig. ( 



constants itlustrative of certain aspects of the frequency-distribu- 
tion, but entirely to replace the table git'ing the frequency- 
distribution, serious inconvenience may be caused, as the 
application of other methods to the data is barred, Given the 
table showing the frequency-distribution, the reader can calculate 
not only the percentiles, but any form of average or measure of 
dispersion that has yet been proposed, to a sufficiently high 
degree of approximation. But given only the percentiles, or at 
least so few of them as the nine deciles, he cannot pass back to 
the frequency-distribution, and thence to other constants, with any 
d^:ree of accuracy. In all cases of published work, therefore, 
the figures of the frequency-distribution should be given ; they 
are absolutely fundamental. ChKH'Ic 



THEORY OF STATISTICS. 
REFERENCES. 



(1) Fecbner, G. T. , " Ueber den Ausgangswerth der kleinsten AbveichiiDga- 

Bamme. dsssen Bestimmung, Verwendung und VerBllgemeinemng," 
Abh d. kgl. idcht, Ois. d. Witsenachafien, vol. xviii. (also numbered 
vol. xi. of the Abk, d. matk.-phys. Cl<U3e) ; Leipzig, 1S7S, p. 1. 

Standard Deviation. 

(2) Pearson, Kakl, " Contributions to the Mathematical Theory of Evolntioii 

(L On the Dissection of Asymmetrical Frequency-curves)." Phil. Tram. 
Soy. Soc., Series A. vol, clxxxv., 1894, p. 71. (Introduction of the term 
"standard devitttiaD," p. 80.) 

' Mean Oeriation. 

(3) Laplace. Pieeisb Simon, Marquis de, Tk^rU analytique dea proioiili- 

Ua: 2™' tuppUmeiU, 1S18. (Proof that the mean deviation is a 
' ' u Hben taken about the median. ) 



method of Percentiles, including Qoartilea, etc. 

(4) Galtoh, Francis, " Statistics by Iiitercompari«on, with Remarks on the 

Law of Frequency of Error," Phil. Mag., vol. ilii. (4th Series}, 1875, 
py. 33-40. 

(5) Galtoh, Francis, Naiurat Iriheritanct i Maomillan, 1889. (The method 

of perceotiles b used thoughout, with the qnartile deviation as the 
measure of dispersion. ) 

Relative Dispersion. 

(6) Pbabson, Kaki, "Regression, Heredity, and Panmixia," Phil. Trails. 

Boy. Soc, Series A, vol. clnxrii., 1866, p. 253. (lutroduotion of 
"coefficient of variation," pp. 276-7.) 

(7) VERSCEAEFfBLT, E., " Ueb«r graduelle Variabilitilt von pflanzliobeu 

EigenschBft«n," £«r. deuUcK. bat. Oes., Bd. lii., 1894, pp. 3S0-55. 

Skewness. 

(8) Peahbon, Kaiu., " Skew Variation in Horaogeneous Material," I%il. 

TrUTis. Boy. Soc., SeriesA, vol. clusvi., 1895, p. 343. (Introduction 
oft«rm, p. 370.) 

Oalcnlaldon of Mean, Standard- deviation, or of tbe Oeneral 
Moments of a Oronped Distribution. 

We have given a direct method that seems the simplest and best for 
the elementary student. A process of successive smnmation that has 
some advantages can, however, be used instead. The stodent will 
find a convenient description with illustrations in — 

(9) Elbbrton, W. Palin. JVejueTwy.cunfM onrf Correlation ; C. & E, 

Layton, London, 1906. 



MBASDRBS OP DISPERSION, BTC. 





Statnrsia 


luohes for Adult Males bom in— 


England. 


Scotbud. 


WJ« 


Ireluid. 


Heui deriation . 

QoBrtUe devudoQ . . 

Mean devikUoQ /standard 

devifttioii 
Lower quartUe . 
Upper „ . . . 


2-56 
2-05 

178 
0-80 

0-69 
69-10 


2-60 
1-95 
1-56 

078 

0(2 

S6-92 

70-04 


2-35 
182 
l-4( 

0-78 

0-62 

86-06 

67 'BS 


2-17 
1-6S 
1-3B 

0-78 

0-82 

66-S9 
69-10 



2. (Continuing from Qu. 2, Chap. Til.) Find the standard deviation, 
mean deviation, quartilea and quartile deviation (or semi- interquartile range) 
for the distribution of weiKhts of adult males in the United Kingdom given in 
the lost column of Table DC., Chap. VI. 

Compare the ratios of the mean and quartile deviations to the atandard 
deviation with Uie ratios stated in gg 19 and 23 to be usual. 

Find the value of the akewness (equation 12), using the approximate value 
of the mode. 

3. Csiiig. or eitending if neoe«sary, your diagram for Question 4, Chap. VII., 
find the quartile values Tor houses ass^sed to inhabited house du^ in 1885-6, 
from the daU of Tahle IV., Chap. VI. 

Find also the 9th deoile (the value eiceeded b; 10 per oent. of the houses 

4. Teriff equation (9) b; direct oalculatiau of the standard deviation of the 
numbers 1 to 10. 

6. (Data ^m Sauerbeck, Jtwr. Jloj). Stai. Soc., March 1909.) The 
following are the indei-unmbers (percentages) of prices of 46 commodities in 
1G08 on tbeir average prices in the years 1867-77 ; — 10, 43, 13, 16, 46, 46, 
64 66 69 62, 64, S4, 66, 66, 67, 67, 66, 68, 69, 69, 69, 71, 75, 75, 76, 76, 
78, 80, 82, 82, 82, 82, 82, 83, 84, 86, 88, 90, 90, 91, 91, 92, 95, 102, 127. 
Find the mean and standanl deviation (1) without further grouping ; (2) 
grouping the numbers by fives(40-, 45-, 60-, etc.); (8) grouping by tens {40-, 
60-, B0-, ete.). 

6. (Continuing from Qu. 8, Chap. VIL) Supposing the frequencies of 
values 0, 1, 2, 3, . . ■ of a variable to be given by the terms of tbe binomial 



^-^tz^). 



Y-'A ■ 



wherep-l-9 = ], find the standard deviation. 

7. [Cf. the remarks at the end of 9 17. ) The sum of the deviations (with- 
out regard to sign) about the centre of tbe chtss-interval containing the utean 



156 THEORY OF STATISTICS. 

(or medUn), in a groupad frequeooy-diatribiitian, is found to be S. Find the 
coTraction to be applied to Uiie sum, in order to reduce it to the mean (or 
median) ae origin, on the assumption that the observationB are evenly dia- 
tributed over each class -interval. Take the number of observationB below the 
interval containing the mean (or median) to be Hi, in that interval iLj, and 
above it n, ; and the distance of the mean (or median) from the arbitrary 

Show that the values of the mean deviation (from the mean and from the 
median respectively) for Eiamplo it, found by the nee of this formula, do not 
differ from the values found hy the simpler method of gg 16 and 17 in the 
second place of decimals. 

8. (W. ScheibnsT, " Ueber ^ittolverthe," BerickU der kgl. idehaUehai 
Gtsellsdutft d. IFitaenseha/l^n, 1873, p. 584, oil«d by Fechner, rof. 2 of 
Chap. VII. : the second fonn of the relation is given by O Duncker (Dm 
Mtthadi der Variatimaiilatiitik ; Leipiig, 1899) as an empirical one. ) Show 
that if doviattons are smalt comparea vritb the mean, so that {x/Jff may be 
neglected in comparison with x/M, we have approximately the relation 



»-H'-i^> 



where O is the geometric mean, M the arithmetic mean, and «- the standard 
deviation : and consequently to the same degree of approximation M' - G''=a^. 
9. (Scheibner, (<w. eit., Qu. 8.) Similarly, sfiow that if deviations are small 
compared with the mean, we have approximately 



H being the hanaonic m 



-('-«. 



N Google 



CHAPTER IX. 

OOBRELAIION. 

I-S. The correlation bible and its formation — 4-6. The correlation surface — 
6-7. The general problem— S-9. The line of means of rows and the 
line of means of columns : their relative positions iu the coae of 
independence and of Fsrying degrees of correlation — 10-14. The 
correlation coefficient and the regressions— 16-18. Numerical calcula- 
tions — 17. Certain points to be remembered in calculating and using 
the coefficient. 

1, In chapters VI. -VIII. we considered the frequency-distribu- 
tion o! a single variable, and the more important constants 
that nkay he calculated to describe certain characters of such 
distributions. We have now to proceed to the case of two 
variables, and the consideration of the relations between tbeiu. 

2. If the corresponding values of two variables be noted 
together, the methods of classification employed in the preceding 
chapters may be applied to both, and a table of double entry or 
contingency- table (Chap. V.) be formed, eihibiting the frequencies 
of pairs of values lying within given class-intervals. Sis such 
tables are given below as illustrations for the following 
variables ; — Table I., two measurements on a shell (Pecten). 
Table II., ages of husbands and wives iu England and Wales in 
1901. Table III., statures of fathers and their sons (British). 
Table IV., fertility of mothers and their daughters (British 
peerage). Table V., the rate of discount and the ratio of reserves 
to deposits in American banks. Table VI.,, the proportion of 
male to total births, and the total numbers of births, in the 
registration districts of England and Wales, 

Each row in such a table gives the frequency-distribution of 
the first variable for cases in which the second variable lies 
within the limits stated on the left of tbe row. Similarly, every 
column gives the frequency -distribution of the second variable 
for cases in which the value of the first variable lies within the 
limits ateted at the head of the column. As "columns" and 
"rows" are distinguished only by the accidental circumstance 



THEORY OF STATISTICS. 



3 


-ssssssss-"" 


s 


s 

B 

1 
.i 

< 


76-78. 


II 1 1 I-" 


73-76. 
70-72. ■ 


1 1 1 1 1 1 1 1 1 M 1 
1 1 1 1 1 1 1 1 -"'" 1 


i 


87-09. 


1 1 1 1 1 1 1 1"" 1 1 


64-66. 


1111111-2111 


s 


61-63. 


1 1 1 1 1 ISS* 1 i 1 


s 


58-flO. 


1 1 1 1 1*S3 1 1 1 1 


3 


56-67. 


1 1 1 l"!gS? 1 1 1 M 


I 


62-64. 


1 1 i-Sg 1 1 1 M 1 


49-51. 


1 I ISiS 1 1 ! 1 1 1 1 


3 


46-43. 


1 las" I 1 { 1 1 1 1 


S 


43-46. 


\'S~\ M 1 1 1 1 1 


3 


40-42, 


IS- 1 1 ! 1 1 1 1 1 1 


2 


37-89. 


■*" 1 1 1 1 


•» 





(2) Dorso-venLrsl diameter, a 



.Goot^lc 



OOliKELATION. 



1 


"iSSIiSiaSSSs- 


i 


1 

■s 

1 


s 


1 1 1 1 1 1 1 1 1 1 1 1 1 " 1 


- 


i 


111,11,11 1 — .,«- 


. 


i 


1 1 1 1 1 1 M (-""S-- 


s 




1 1 1 1 1 1 1 i-'saa-"- 


1 


1 1 1 1 1 [—"SISSS"- 


i 
i 


1 1 1 1 |-""SSSS2— 1 


1 1 1 |-="S33SS-"- 1 


i 


1 1 1— '3SSSS = ""- 1 


5 


ll-3SgSSS— 11 


1 


^ 


1I~2S|SES— II 


S 


s 


1-251 = 8 = -"" IN 


K 


= - 


I-SS3CS--I II 1 


» 


s 


isgsss--iri II 1 


i 


k 


"Sgs""- n 1 1 1 1 1 1 


5 


i. 


"Is — 1 1 1 1 1 1 1 1 1 1 1 


S 


HH^iikiiklHk 1 



(2) Ages of HuBbanda. 



Google 



THEORY OF STATISTICS. 



i 




1 


1 

■s 

1 


T4■!^.TB■^ 


1 1 M 1 M 1 U^„l^«l 1 1 1 1 


3 


1 1 1 M 1 1 1 1 12^'''' n 1 1 1 


• 


7rs-is-6. 

TIS-TS'S. 


JSs.HH_s"? 


s 


-'ssl.slE-.i^'si 


is 


To-i-n-6. 


li^SiSlSsli-o-SI'lll 


M-B-70-IS. 


l"''^S-i|!|2"li--l 


«8'S-«»-S. 


Mn«Ss|,E§|!L^^_, 


i 


gJ-^-K-S. 


irsss.is.s^gsua,, 


i 


6e'8-6T-6. 


'-s-sfeii-s 


s 


»„.... 


irJiisiPS-! 


s 


MM6 8. 


.a . sasf s^ 


1 




flSS-MR. 


'i-Jss|„s5' 


s 




6a-8-«8-6. 


•"■8-SSssLi 1 i-i 1 1 1 1 


r 




ars-ws. 


"iSLH.iNii II 1 III 


s 




IW6^1-6. 


ii?a,-^,:_ Ill 


».«.... 


ll??,,.; Ill 


■p 


M-Me'B. 1 1 1 1 l-« 1 { 1 1 1 1 11 1 1 M 1 1 


. 




iiiiiliiiiiiiiii 


1 









(2) SUUu« ol Sod. 



■, Goo»^lc 



CORHBLATION. 



1 


sssgsgssEsa«s" 


i 


1 

j 


B* 


M- 1 1 1 1 1 1 1 1 1 1 1 


- 


^ 


1 1 1 1 1 1 1 1 1 1 M- 1 


" 


V 


1 M 1 1 1 1 1- 1- 1 1 1 


- 


2 


1 1 1 I-" — " III"" 


s 


S' 


"-l-»""— --1 1 1 


n 


f4 


"""°"'""="°' 1 1 — 


s 


d 


- — "•"S"""' 1 1 


s 


- 


— '-- '3S— '2-"- 1 1 


s 


« 


— "SS33 — "• l-l 


s 


^ 


""»SBa2-"" 1- 1 


s 


« 


32232-33 = — '- |- 


s 


» 


523*532"'*""""'' 


1 


«: 


SSSSSSS-S-""" 1 


s 


» 


assss—- """" 1 


s 


M 


""•s-— -=• 1 1 1" 1 


s 


-■ 


-a*-"— "" 1 1 1 1 1 


s 


"""—■— -=« 1 



(2) Nnntbar of lier D&aglitar*s CbildreD, 



H(le 



THEORY OF STATISTICS. 



hi 






II - 



sill 



pii 



3 


" — "SSSSSSSSSSSS-SSSSS — " 


s 


i 
i 

3 


s 


II|-|"1|-||IIIIMIIII11II 


' 


s 


" 1 1 1 1 1 1 1 


DO 


s 


1 1 1 1--- 1 1 1 1 1 1 1 1 


a 


1 1 1 1- 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 


- 


s 


1 1 I--" 1" 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 


- 




" 1 1 l-~ 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 

1 1 1 1 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 1 1 1 1 1 1 1 


1 


- 


1 1 1 r--"" 1 1 1 M 1 1 1 1 1 1 


: 


» 


1 1 1 I--"" 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 


2 


. 


1 1 l-^s-'" 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 


s 


f 


1 1 1 1"— 1 1 1 1 1 1 M 1 1 1 1 1 1 1 M 


s 


- 


- 1- I-2S— -"- 1 1 1 1 1 1 1 1 1 1 1 1 1 1 


s 


S 


1 1 l""22»» 1--- II 


¥ 


1- 1 |~53-«->" 1- 1 1 1 1 1 1 1 1 1 1 1 1 


1 1 1 ,-.„_„,„ I-I 1 1 1 1 


s 


1 1 1 1 l=<3aS"--| III 1 


1 1 1 1 i l"SS2""— 1 1 1 1 1 M 1 M 1 1 


g 


" 


1 M 1 1 I2SS3«2 1 1 1 1 1 1 1 


g 


s 


II,,, |_«0,2O,«_«c ,_=.,„„ , , 1 


s 


- 


,,,,,, 1 ,«-.3«».„, .,„,.,_„„ 


2 




sssssssgsigssssssgsssssasss 



(2) Percentage Ratio of Resarves to Deposits. 



COSDELATION. 



1 1 3§S«5332-'""SS-"'>""- !-««!« 1 — 


i 


5 

■s 

§ 
1 
1 

1 
t 


M3-t6. { " 1 1 1 M I 1 1 1 1 1 I 1 1 1 U 1 1 [ 1 1 1 1 1 1 1 


- 


64(^2. - 1 1 M M M 1 1 i 1 1 1 H 1 1 1 1 1 1 U 1 1 1 


- 


£ST-30. M 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 


1 


iSi-311. *>'<l 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 M 1 1 1 1 1 1 ! 


» 


msa. - 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ; 1 1 1 n 1 1 11 


- 


M8-S0- » 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 


" 


BO-n. 1 « 1 1 " 1 1 1 1 1 1 1 M 1 1 1 M 1 M 1 1 1 1 M 1 




sai-M. ■='-"• M 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 


- 


6i»-si. •='■-"'^1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M M 1 1 J 


8 


Blfl-18. 


SS*-*-""- M 1 1" 1 M 1 1 1 1 1 1 11 1 1 1 1 1 


5 


613-lB. 


ssa*—"""-— 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 


% 


610-12. ssas=-"-| I""" 1 '«'•«- 1 1 1 1-- 1" 1 1 1 


a 


■607^».- sas---"-"— — i — rr"-! ir" 


§ 


604-08. SSSS"-""" 1 1 «««■-« 1«-| M 1 1 1 1 1 I 


s 

$ 

S 


BOl-03. 1 Sg'=-| l«l 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 


m-m.\ »--| M n 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ! 1 M 


l»6-«. 

tn-9t. 


-°«l 1 1 1 1 


s 


— > 1 1 1 1 1 1 1 1 1 1 1 1 1 M n 1 1 1 n 1 1 1 1 


= 


*sa-in. 1 »» 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M M 1 


- 


186-88. 
483-SS. 


« M 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 M 1 1 1 


. 


"" 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1! 1 1 


« 


m-si. « 1 I 1 i 1 { 1 1 1 1 1 1 1 1 1 1 [ 1 1 1 1 1 1 H 1 1 


- 


1TT-T9. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 U 1 1 1 1 1 1 


1 


4Tt-T«. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 I 1 1 1 1 1 1 1 1 1 1 1 1 1 


1 


471-73. ■< 1 1 1 11 1 1 1 1 i 1 i 1 1 1 i 1 1 1 1 1 I 1 1 1 1 1 


- 


us-70, 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 M 1 1 1 1 I M 1 


1 


466-6T. -^ 1 1 1 1 M i 1 1 i 1 1 1 1 M 1 1 1 n 1 M 1 1 1 


" 





n ot Blrthi in Dlitrlct (OOO'i omitted) dnrUig Q 



164 THKORT OF STATISTICS. 

of the one set ruiming vertioall; and the other horizontally, and 
the difierenoe has no Btatiatical significance, the word array 
hoB been suggested as a convenient term to denote either a row 
or a column. If the values of X in one array are associated 
with values of Y between the limits I'm — 2 and r, + S, Y^ may be 
termed the type of the array. (Pearson, ref. 6.) The special 
kind of contingency tables with which we are now concerned 
are called COrmUttion tables, to distinguish them from tables 
baaed on unmeasured qualities and so forth. 

3, Nothing need be added to what was said in Chapter VI. aa 
regards the choice of magnitude and position of class-intervals. 
Wben these have been fixed, the table is readily compiled by 
taking a large sheet ruled with rows and columns properly 
headed in the same way as the final table and entering a dot, 
stroke, or small cross in the corresponding compartment for each 
pair of recorded observations. If facility of cbeoking be of 
great importance, each pair of recorded values may be entered 
on a separate card and these dealt into tittle packs on a board 
ruled in squares, or into a divided tray; each pack can then be 
run through to see that no card baa been mis-sorted. The 
difBculty as to the intermediate observations — values of the 
variables corresponding to divisions between class-intervals — will 
be met in the same way as before if the value of one variable 
alone be intermediate, tbe unit of frequency being divided 
between two adjacent compartments. If both values of the pair 
be intermediate, the observation must be divided between /owr 
adjacent compartments, and thus quarters as well as halves may 
occur in the table, as, «.y., in Table III. In this case the statures 
of fathers and sons were measured to the nearest quarter- 
inch and subsequently grouped by 1-inch intervals : a pair in 
which tbe recorded stature of the father is 60'6 in. and that of 
the son 62*5 in. is accordingly entered as 0'25 to each of the 
four compartments under the columns 59-5-60'5, 60-5-61-6, and 
the rows 61-5-62-5, 62-5-63'5. Workers will generally form 
their own methods for entering such fractional frequencies 
during the process of compiling, but one convenient method is 
to use a small x to denote a unit and a dot for a quarter ; the 
four dots should be placed in the position of the four points 
of the X and joined when complete. It is best to choose the 
limits of class- intervals, where possible, in such a way as to avoid 
fractional frequencies. 

4. The distribution of frequency for two variables may be 
represented by a surface or solid in the same way as the frequency- 
distribution of a single variable may be represented by a plane 
figure. We may imagine tbe surface to be obtained by erecting 



COBBELATION. 166 

at the centre of ever; compartment of the correlation -table a 
vertical of length proportionate to the frequency in that com- 
partment, and joining up the tops of the verticals. If the 
compartmentB were made smaller and smaller while the olaea- 
frequenciea remained finite, the irregular figure BO obtained would 
approximate more and more cloeely towards a continuous curved 
surface^a frequency-aurface— corresponding to the frequency- 
curves for single variables of Chapter VI. The volume of the 
frequency-solid over any area drawn on ite base gives the 
frequency of pairs of values falling within that area, just as the 
area of the frequency-curve over any interval of the base-line gives 
the frequency of observations within that interval. Models of 
actual diatributions may be constructed by drawing the frequency- 
distributions for all arrays of the one variable, to the same scale, 
on sheets of cardboard, and erecting the cards vertically on a 
base-board at equal distances apart, or by marking out a base- 
board in squares corresponding to the compartments of the 
correlation-table, and erecting on each square a rod of wood of 
height proportionate to the frequency. Such solid representations 
of frequency-distributions for two variables are sometimes termed 

fltjirpng-fama 

t}. it is impossible, however, to group the majority of 
frequency -surf aces, in the same way as the frequency-curves, 
under a few simple types ; the forms are too varied. The simplest 
idea] type is one in which every section of the surface is a sym- 
metrical curve — the first type of Chap. VI. (fig. 5, p. 89). Like 
the symmetrical distribution for the single variable, this is a very 
rare form of distribution in economic statistics, but approximate 
illustrations may be drawn from anthropometry. Fig. '29 shows 
the ideal form of the surface, somewhat truncated, and fig. 
30 the distribution of Table III., which approximates to the same 
type,— the difference in steepness is, of course, merely a matter of 
scale. The maximum frequency occurs in the centre of the 
whole distribution, and the suiface is symmetrical round the 
vertical through the maximum, equal frequencies occurring at 
equal distances from the mode on opposite sides. The next 
simplest type of surface corresponds to the second type of 
frequency -curve — the moderately asymmetrical. Most, if not all, 
of the distributions of arrays are asymmetrical, and like the dis- 
tribution of fig. 9, p. 92 : the surface is consequently asymmetrical, 
and the maximum does not lie in the centre of the distribution. 
This form is fairly common, and illustrations might be drawn 
from a variety of sources — economics, meteorolc^, anthropometry, 
etc. The data of Table II. will serve as an example. The total 
distributions and the distributions of the majority t^^^^^^^^yB 



THBOHr OF STATISTICS. 



■,CH)Ogle 



n,gN..(JNGOOglC 



CORRELATION. 167 

are aBymmetrical, the akewnesG being positive for the rows at 
the top of the table (the mode being lower than the mean), and 
negative for the rows at the foot, the more central rows being 
□early symmetrical. The maximum frequency lies towards the 
upper end of the table in the compartment under the row and 
column headed " 30 - ", The frequency falls off very rapidly \ 
towards the lower ages, and slowly in the direction of old age. 
Outside these two forms, it seems impossible to deiiuit empirically ■ 
any simple typM. Tables V. and VI. are given simply as illus- 
trations of two very divergent forms. Fig. 31 gives a graphical 
representation of the former by the method corresponding to the 
histogram of Chapter VI., the frequency in each compartment 
being represented by a square pillar. The distribution of 
frequency is very characteristic, and quite different from that 
of any of the Tables I., II., III., or IV. 

6. It is clear that such tables may be treated by any of the 
methods discussed in Chapter V,, which are applicable to all 
contingency- tables, however formed. The distribution may be 
investigate in detail by such methods as those of g 4, or tested 
for isotropy {§ 11), or the coefficient of contingency can be 
calculated (§§ 5-8). In applying any of these methods, however, 
it is desirable to use a coarser classification than is suited to the 
methods to be presently discussed, and it is not necessary to 
retain the constancy of the class-interval. The classification 
should, on the contrary, be arranged simply with a view to avoiding 
many scattered units or very small frequencies. A few examples 
should be worked as exercises by the student (Question 3). 

7. But the coefficient of contingency merely tells us whether, 
and if so, how closely, the two variables are related, and much 
more information than this can be obtained from the correlation- 
table, seeing that the measures of Chapters VII. and VIII. can be 
applied to the arrays as well aa to the total distributions. If the 
two variables are independent, the distributions of all parallel 
arrays are similar (Chap. V. | 13); hence their averages and 
dispersions, e.g. means and standard deviations, must be the same. 
In general they are not the same, and the relation between the 
mean or standard deviation of the array and its type requires 
investigation. Of the two constants, the mean is, in general, the 
more important, and our attention will for the present be con- 
fined to it. The majority of the questions of practical statistics 
relate solely to averages : the most important and fundamental 
question is whether, on an averse, high values of the one variable 
show any tendency to be associated with high (or with low) 
values of the other. If possible, we also desire to know how great a 
divei^enoe of the one variable from its average value is associated 



168 



THEOEY OF STATISTICS, 



with a unit divergence of the other, aad to obtain some idea as to 
the cioBenesG with which this relation is UBuallj fulfilled. 

8. Suppose a diagram (fig. 32) to be drawn representing the 
values of means of arrays. Let O^, OFbe the scales of the two 
variables, i.e. the scales at the head and side of the table, 01, 12, 
etc., being successive class-intervals. Let M^ be the mean value 
of X, and M^ the mean value of Y. If the two variables be 
absolutely independent, the distributions of frequency in all 
parallel arrays are similar (Chap. V. § 13), and the means of arrays 
must lie on the vertical and horizontal lines MjM, M^M, the 



*,_ 












\ 












\ 










•\ 






6-+- 


-+- 




i'-'i 


-+- 








\ 










\ 












i^ 


4^ 


u 






JA 



small circles denoting means of rows and tbe small crosses means 
of columns. (In any actual case, of course, the means would not 
lie so regularly, but, if the independence were almost complete, 
would only fluctuate slightly to the one side and the other of the 
two lines.) 

The cases with which the eiperimoutaliat, e.g- the chemist or 
physicist, has to deal, where the observations are all crowded 
closely round a single line, lie at the opposite extreme from 
independence. The entries tall into a few compartments only of 
each array, and the means of rows and of columns lie approximately 
on one and tbe same curve, like tbe line JiS of fig. 33, 

The ordinary cases of statistics are intermediate between these 
two extremes, the lines of means being neither at right angles as 



OORBEL&TIOK. 



169 



in fig. 32, nor coincideut as id fig. 33, but standing at an acute 
angle with one another as EB (means of rows) and CC (means of 
columns) in figs. 36—8. The complete problem of the statistician, 
like that of the physicist, is to find formuto or equations which 
will suffice to describe approximately these curves. 

9. In the general case this may be a difficult problem, but, in 
the first place, it often suffices, as already pointed out, to know 
merely whether on an avert^e high values of the one variable 
show any tendency to be associated with high or with low values 
of the other, a purpose which will be served very fairly by fitting a 




straight line ; and further, in a large number of cases, it is found 
either (1) that the means of arrays lie very approximately round 
straight lines, or (2) that they lie so irregularly (possibly owing 
only to paucity of observations) that the real nature of the curve 
is not clearly indicated, and a straight line will do almost as well 
as any more elaborate curve. (Cf. figs. 36-38.) In such cases 
— and they are relatively more frequent than might be supposed 
— the fitting of straight lines to the means of arrays determines 
all the most important characters of the distribution. We might 
fit such lines by a simple graphical method, plotting the points 
representing means of arrays on a dif^ram like those of figures 
36-38, and "fitting" lines to them, say, by means of a stretched 
black thread shifted about till it appeared to run as near as 



170 



THEOBY OF STATISTICS. 



might be to all the pointa. But such a method is hardly satis- 
factory, more eepecially if the points are somewhat scattered ; it 
leaves too much room for guesswork, and different observers obtain 
very different results. Some method is clearly required which 
will enable the observer to determine equations to the two lines 
for a given distribution, however irregularly the means may lie, 
as simply and definitely as he can calculate the means and 
standard deviations. 

10. Consider the simplest case in which the means of rows lie 




exactly on a straight line ££ (fig. 34). Let M^ be the mean 
value of ¥, and let BB cut M^, the horizontal through J/j, in JH. 
Then it may be shown that the vertical through if must cut OX 
in J/], the mean of X. For, let the slope of JtR to the vertical, 
I.e. the bingent of the angle M^MR or ratio of kl to IM, be b^, 
and let deviations from My, Mx be denoted by x and y. Then for 
any one row of type y in which the number of observations is n, 
X{x) = n.b-^y, and therefore for the whole table, since 2(ny) = 0. 
2(j;) = 6[5(ny) ^ 0. M-y must therefore be the mean of X, and 
M may accordingly be termed the mean of the whole distribution. 
Knowing that RB passes through M, it remains only to determine 



CORRELATION. 171 

bj. This may conveniently be done in terms of the mean product 

p of all pairs of aasoeiated deviations x and y, i.e. — 

y-is(«,). . . . . (1) 
For any one row we have 

Therefore for the whole table 

or 6,«^, (2) 

Similarly, if CC be the line on which lie the means of columns 
. and bj its slope to the horizontal, rs/sM, 

!...£-, . . . . . (3) 

These two equations (2) and (3) are usually written in a 
slightly different form. Let 

r=^ (4) 

Then 6,=r-' *2 = r°> . . (5) 

Or we may write the equations to JiR and CC — 

«-Ay y = r~'.x . . (6) 

These equations may, of course, be expressed, if deaired, in 
terms of the absolute values of the variables X and Y instead of 
the deviations ;3: and y. 

11. The meaning of the above expressions when the means of 
rows and columns do not lie exactly on straight lines is very 
readily obtained. If the values of x and 6j.y be noted for all 
pairs of associated deviations, we have for the sum of the 
" squares "of the differences, giving 6j its value from (5), 

3(ar-Vy)= = ^.<r/(l-»^) . . . (7) 

It if be given any other value, say (r + S)—, then 

S(»-6,.y)>-jr„.>(l -r' + 8'). ,,,. Qooglc 



172 THBOBY OF STATISTICS. 

This ia necessarily greater than the value (7) ; beDce 2.(x - bjy)^ 
hat the lowest possible vcdue when b-^ i» put equal to rtr^Vy. 
Further, for any one row in which the number ot observations 
is Ti, the deviation of the mean of the row from RR ia d (fig. 35), 
and the standard deviation is »„ S(a: - \yf = n»^ + n.cP. There- 
fore for the whole table, 

2(:r-Vyf=3(»M,*) + S(™f»). 

But the first of the two sums on the right is unaffected by the 




slope or position of BR, hence, the left-hand side being a 
minimum, the second sum on the right must be a minimum ^sa 
That is to say, when ij ia put equal to r ujtr^ the aitm of the square! 

of the dislanceg of the rout-means from RR, each multiplied by the 
COrre^iOTiding frequency, it the loweitpostible. 

Similar theorems hold good, of course, with respect to the line 

CC. If 6j be given the value r ~, S(a; - Sj-y)' is a minimum, 

and also 2(n.«^) (fig. 35). Hence we may regard the equations (6) 
as being, either (a) equations for estimating each individual x 
from its associated y (and y from its associated «) in such a way 



CORKKLATION. 



173 



aa to make the sum of the aquaree of the errors of eatimate the 
least possible ; or (b) equations for eatimatiog the mean of the ^s 
associated with a given type of y (and the mean of the y's associated 
with a given type of x) in such a way as to make the sum of the 
squares of the errors of estimate the least possible, when eveiy 
mean is counted once for each observation on which it is baaed. 





u4y* <W fife 
i 30 4C BO «0 70 A 


c 


% 














\ 


N 














\ 


V 












^ 


V 














\ 


\ 














\ 





Fio. 36. — Comlation betwesn Age of HuHband »nd Age of Wife in England 
and WalvB (Table II.) : msana of lows shown by circles and Qteans of 
columns bjcra«BeB! r=+0-9I. 

The lines represented by the two equations are thus, in a certain 
natural sense, "lines of best fit " to the two actual lines of means. 
12. The constant r is of very great importance. It ia evi- 
dently a pure number, and its magnitude is unafTected by the 
scales in which x and y are measured, for these scales will 
affect the numerator and denominator of (4) to the same 
extent. If the two variables are independent, r is zero, for b^ 
and 6j are sero (ef. § 8), The sign is the sign of the mean 
product p, and accordingly r is positive if large values of x 



174 



THEORY OF STATISTICS. 



are associated with targe values of j/, and ooaveraely (as in 
Tables I. -IV.) negative if small values of x are associated with 
large values of y, and conversely (as in Table V.). The numerical 
value cannot exceed ± 1, for the sum of the series of squares 
iu equation (7) is then zero and the sum ot a series of squares 
cannot be negative. If r= ± 1, it follows that all the observed 
pairs of deviations are subject to the relation xly = <Tj<T^: this 









""N 


\ 






















\, 














c 










\ 














*" 


\ 






\ 


















■ 


\ 




\ 










|«9 












■\ 






















\ 


"-V. 






1 














\ 




^ 


•\ 


















\, 






















\ 
























k 






















\ 





would be the case if the circles and crosses in such a diagram as 
fijf. 33 all lay on one and the same straight line. From these 
properties r is termed the coefficient of correlation, and the 
expression (4), T = }ijrT,iTy = %{xy)jN.(T,<T^ should be remembered. 

It should be noted that, while r is zero if the variables are 
independent, the converse is not necessarily true: the fact that 
r is zero only implies that the means of rows and columns 
lie tcatttred round two straight lines which do not exhibit 



COKKELATION. 175 



any definite trend, to right or- to left, upward or downward. 
Two variables for which r is zero are, however, conveoieritly 
Bpoken of as uncomlated. Table VI. and fig. 39 will serve as an 
illustration of a case in which the variables are almost uncor- 
related but by no means independent,r being very small ( - O'Oli), 
but the coefficient of contingency C 0-4-7. 

Figs. 36, 37, 38 are drawn from the data of Tables II., III., and 
IV., for which r has the values +0-91, + 0-51, and +0'21 respec- 
tively, the correlation being powtive in each case. The student 

JfumAer of Mottter^ ChUdren.. 



m 



\ ' 



Flo. 38.— Correlation between number of a llother's Children and number of 
her Daughter's Children (Table IV.): raeans of rovs ahown bj circlea 
and means of colamna by crosses : r— -t-O'SI. 

ebould study such tables and diagrams closely, and endeavour to 
accustom himself to estimating the value of r from the general 
appearance of the table. 
13. The two quantities 

are termed the coefficients of regresBion, or simply the r^ressions ; 
d] being the regression of x on y, or deviation iu x corresponding 
on the average to a unit change in the type of y, and h^ being 



176 



THEORY OF STATISTICS. 



Bimilarlj the regrsBsioD of y on x. Whilst the coefficient of 
correlation is always a pure number, the regreeHioDB are only 
pare numbera if the two variablea have the same dimensionB, as 
in Tables I.-IY. : their magmtudes depend on the ratio of vjay, and 
consequently on the units in which r. and y are measured. They 
are both neceBsarily o/ the same sign (the sign of r). Since r is 





Fi^ 


rtion.of 




rdia -pa- 


1000 2 


■irA,. 




•"-—- 


--*-. 






...-■* 
























i. 


i 






l„ 






\ 








1 . 














■s 














1 











































Fig. 39. — Correlation between PojnilstioD of a Registration District and Pro- 
portion of Male Births per thousand of all birtlis (England and Wales, 
1881-90, Table VL); means of rows shown b; , circles and means 
of oolumiiB by urosaes : r— - 0'014. 

not greater than unity, one at least of the regresBiona must be 
not greater than unity, but the other may be considerably greater 
it the ratio a-Ja-y or <rja; be great. The name Tegretsion arose 
from the term being 6rBt introduced in the case of inheritance of 
stature (Galton, refa. 2, 3). Id this case the two standard devia- 
tiouB are very nearly equal, so that both 6, and b^ are leas than 
unity, aay (using the more recent data of Table III.) 0-60 and, 0*52. 



COKKHLATION. 177 

Hence the sons of fathers of deviation x from the meHn of all fathers 
have an average deviation of only 0'53a; from the mean of all sons ; 
i.e. they step back or "regress" towards the general mean, and 052 
may be termed the " ratio of regression." In general, however, 
the idea of a " stepping back " or " regression " towards a more 
or less stationary mean is quit« inapplicable— obviously so where 
the variables are different in kind, as in Tables V, and VI. — 
and the term " coefficient of regression " should be regarded simply 
as ft convenient name for the coefficients h-^ and b^. RR and CC 
are generally termed the "linea of regression," and equattous (6) 
the " regression equations." The expressions " charaoteristio lines, 
" charaoteristio equations " (Yule, ref. 8) would perhaps be better. 
Where the actual means of arrays appear to be given, U> a satis- 
factory degree of approximation, by straight lines, we may say 
that the regremon i» linear. It is not safe, however, to assume 
that such linearity extends beyond the limits of observation. 

14. The two standard deviations 

are of considerable importance. It follows from (7) that s, ia the 
standard deviation of {x — b^y), and similarly «, is the standard 
deviation of (y - h^-ic). Hence we may regard », and «, aa the 
standard errors (root mean square errors) made in estimating x 
from y and y from x by the respective characteristic relations 

x = hyy y = bj^. 

s, may also be regarded as a kind of average standard deviation of 
a row about RR, and «, as an avert^e standard deviation of a 
column about CC. Id au ideal case, where the regression is 
truly linear and the standard deviations of all parallel arrays are 
equal, a case to which the distribution of Table III. is a rough 
approximation, s. is the standard deviation of the aj-array and », 
the standard deviation of the y-array (c/. Chap. X. § 19 (3)). 
Hence », and s„ are sometimes termed the "standard deviations 
of arrays." 

15. Proceeding now to the arithmetical work, the only new 
expression that has to be calculated in order to determine r, 6,, ftj, 
a„ and «, is the product sum %{xy) or the mean product p. As in 

standard deviations, tlie form of the 



arithmetic is slightly different according as the observations are 
few and ungrouped, or sufficient to justify the formation of a 
correlation-table. In the first case, as in Example i. below, the 
work is quite straightforward. 

Example i., Table VII.— The variables are (1) J~the estimated 
12 S'^' 



THEORY OF STATISTICS. 



Table VII. Thbort of Corebhtion ; ExampU I 



, 1. 


2. 


s. 


4. 


fi. 


e. 


,. 


8. 






X. 


Y 


X. 


y^ 






Product* u,. 




EiUm>ted 

AVBfigB 


age of 



















KP 


SXln 


DoTfa- 

Maan 
{Pence> 


Devlft- 










Union. 


BhUllDKi 


Poor- 


Hon of 


*5. 


»». 


P«l- 


«^' 




andPeoM 


lav 














per Week. 


Kdlel. 














1. aiBadtte . . 


lb 


i-M 


+S8 


-l-K 


8364 


1812S 




7l-« 


2, Walton , . 


«> 3 








2«M 


1-9044 






3. Guituig . . 




I-S8 


+46 






fi'19S4 




102-80 


*. B«lp6r . , 




1-W 


+S1 






s-oeaii 




H;2fi 


6, Suitwieh 




S-98 














e. AtcbuD . . 


IJ 




+» 


jS. 


IM 


e-2600 

00144 


1-88 


47|M 


a. utwieter 








-TTm 


lOT 








». Wstbsrbj 


















10. ElUDtwold . 




2-78 


+1S 


"o-sa 








10-88 


11. Sonthwell . 




•OS 






4S 


0MS4 






IS. HnlKngbourn . 


16 4 
















W. Uelton Mowbrmi 




■81 


+ 4 






11238 




4-24 
























f 








0-4228 






ib! tonth . : 


?T 


+ 1 


^^ 


1 


0-280S 


4M 


- 


is: CrwiEton . 


















le. Holbeftoh . . 




*T6 




(+103 




11864 




8-4U 


20. UaldoD . . 




4-61 








0«400 




4-as 












49 








f3. 81 N»oU . . 




1*6 








4-WOI 


18-08 




21, anffbun 






















33B 








0-0841 






EsiThdto'e"". : 












4-T08B 




2887 


te. Thlngve . . 








/+o»» 


















+0M 




O'0«8 




2-88 










V+j:*/ 




07689 




e-iiv 










■|tFS 




0-06M 






w: Pe«ej . 




G'Sg 






IM 


4-8841 




sow 






4'Se 




+o'as' 




0-4J81 






Si. Wantage . , 








+018 
















-Ifl 


+0-26 


















+0-81 




0-8681 




13-n 


















34 -no 






t-»1 


-IS 


+124 


set 






23-118 


37. Pmhore . . 








l^? 




0(489 




19-43 


88. Langporl . . 


12 fl 


6'19 


-41 


1081 




- 


B2-82 




Mean 


Mran 


x\y 




18,018 


8S-M6fl 


S2-1S 


gjg.„ 












ao-sd. 


1-28% 






ilM)- 


-8W04 










riv 


He':' 


^ 


.,.i . 





CORRKLATIOH. 179 

average weekly earningB of agricultural laboiirera in 38 Englieh 
Poor-&,w uniona of &n agricultuml tjpe {the data of Example i., 
Chap. VIII. p. 137). (2) Y~t\ie percent^e of the population 
in receipt of Poor-law relief on the Ist January 1891 in each ol the 
same unions {B return). The means of each of the variables are 
calculated in the ordinary way, and then the deviatiooB x and y 
from the mean are written down (columna 4 and 5) ; care must 
be taken to give each deviation the correct sign. These deviations 
are then squared (columns 6 and 7) and the standard deviations 
found as before (Chap. VIII. p. 136). Finally, every x is 
multiplied by the associated y and the product entered in column 
8 or column 9 according to its sign. These columns are then 
added up separately and the algebraic sum of the totals gives 
S(3^)= -666-04: therefore the mean product p = S(:i;y)/JV"= - 
17-53, and ,\ 

_J7^53___ t,^-* 1 

20-5x1-29 Y ' _■/■-. 

There is therefore a well-marked relation exhibited by these data 
between the earnings of agricultural labourers in a district and 
the percentage of the population in receipt of Poor-law relief. 
A penny is rather a small unit in which to measure deviations in 
the average earnings, so for the regressions we may alter the unit 
of X to a shilling, making (t,= I'Tl, and 

6, = r^ = - 0-87, b„ = r^= - 0-SO. 

The regression equations are therefore, in terms of theae units, 

x= -0-87y y= - 0-50a-. 

For practical purposes it is more convenient to express the 
equations in terms of the absolute values of the variables rather 
than the deviations 1 therefore, replacing a; by (X-15-94) and y 
by (F-3'67) and simplifying, we have 

-^=19-13-0-87^ . . . . (a) 
r=Il-64-0-50Jr . . . .(b) 

the units being Is. for the earnings and 1 per cent, for the 
pauperism. The standard errors made in using these equations 
to estimate earnings from pauperism and pauperism from earnings 
respectively are 

tr,Vl-r'=15-4d. = I-288. 



^^Jl-r^^ 0-97 per cent. 



■, Goo»^lc 



180 



THKORT OF STATISTICS. 



The equation (i) tells ua therefore that a rise of 28. in eamiaga 
in passing from one district to another means on the woerage a 
fall of 1 in the percentage in receipt of relief. A natural con- 
clusion would be that this mean.^ a direct effeot of the higher 
earnings in diminishing the necessity for relief, but such a 
conclusion cannot be accepted offhand. Equation (a) indicates, 
for instance, that every rise of a unit in the percentage re- 
lieved corresponds to a tall of 0'87 shillings, or lOJd. in earning : 
this might mean that the givii^ of relief tends to depress WE^es. 
Which is the correct interpretation of the facta f The above 



\ 12 i3 „i4- 



}e 17 18 



\^ 




\' 
















> 




K 


















^ 


V- 


















•■\ 




\^ 
















\ 




\ 


\ 














\ 







FlO. 40. — Correlation between Pftaperiam and Ave 
Labonrers for certain di ' 
CC, lines of regresBion : 

regression equations alone cannot tell us this, and it ia in the 
discussion of such questions that most of the difBcultiea of statisti- 
cal arguments arise. 

As a check on the whole of the arithmetical work, and to test 
whether the correlation coefficient is unduly affected by a tew out 
lying observations, or, perhaps, by the regression not being linear, 
it is always aa well to draw a diagram representing the results 
obtained. Take scales along two axes at right angles (fig. 40) 
representing the variables, and insert a dot (better, for clearness, 
a small circle or a cross) at the point determined by each observed 
pair of X and y. Complete the diagram by inserting t^^wo lines 



COORBLATION. 181 

BR and CC given by the regression equations (a) and (6). In 
doing this it is ae well to determine a point at each end of both 
lines, and then to check the work by eeeing that they meet in the 
mean of the whole diBtribution. Thus RR is determined from (a) 
W_!^-Xl'"J'l >^~"- ■y-l^'l" tin^ Tl-". ^=1391: CC ia 
determined from (6) by the' points X= 12, r=5'64 and X=2\, 
F = 1'14. Marking in these points, and drawing the lines, they 
will be found to meet in the mean, X= 15-94:, 7=3'67. The 
diagram gives a very clear idea of the distribution ; iilearly the 
regression ia as nearly linear as may be with so very scattered a 
distribution, and there are no very exceptional observations. The 
most exceptional districts are Brixworth and St Meota with rather 
low earnings but very low pauperism, and Glendale and Wigton 
with the highest earnings but a pauperism well above the lowest — 
over 3 per cent 

16. When a classified correlation-table is to be dealt with, the 
procedure is of precisely the same kind as was used in the calcula- 
tion of a standard deviation, the same artifices being used to shorten 
the work. That is to say, (1) the productsum ia calculated in the 
first instance with respect to an arbitrary origin, and is afterwards 
reduced to the value it would have with respect to the mean; (3) 
the arbitrary origin is taken at the centre of a class-interval ; (3) 
the class-interval is treated as the unit of measurement throughout 
the arithmetic. 

Let deviations from the arbitrary origin be denoted by ^-q, and 
let 1^ be the co-ordinates of the mean. Then 

■ ". ^ = x)/ + iy + Tjx + ^. 

Therefore, summing, since the second and third sums on the 
tight vanish, being the sums of deviations from the mean, 

2(^) = 2(:ry).l-iP|^, 

or bringing 2(a;y) to the left, 

That is, in terms of mean-products, iising p' to denote the mean- 
product for the arbitrary origin, 

P=P-H 

In any case where the origin from which deviations have been 
mea«ured is not the mean, this correction muat be iraed. It will 
sometimes give a sensible correction even for work in the fon^- of 



182 THEORY OP STATISTICS. 

Example i., and in that case, of course, the Btandard deviatione 
will also require reduction to the mean. 

Aa the arithmetical process of calculating the correlation co- 
eiScient from a grouped table is of great importance, we give two 
illustrations, the first economic, the second biological. 

Example ii., Table VIII.— The two variables are (1) X, the 
peroentAge of males over 65 years of age in receipt of Poor-law 
relief in 235 unions of a mainly rural character in England and 
Wales ; (2) Y, the ratio of the numbers of persons given relief " out- 
doors " (in their own homes) to one " indoors " (in the workhouse). 
The figures refer to a one-day count (1st August 1890, No. 36, 
1890), and the table is one of a series that were drawn up with 
the view to discussing the influence of administrative methods on 
jHiuperism. (^Economic Jounud, vol. vi,, 1896, p. 613.) 

The arbitrary origin for X was taken at the centre of the fourth 
column, or at 17'5 per cent. ; for Y at the centre of the fourth 
row, or 3'5. The following are the values found for the constants 
of the single distributions : — 

1= -01532 iDterval8= -077 per cent., whence M,= 

16-73 per cent. 
o-, = I'29 intorvals = 6'15 per cent. 
?!= -hO-36 intervals or units, whence .il/„ = 3'86. 
<ry = 2-98 units. 

To calculate 2(f>j), the value of fij is first written in every 
compartment of the table against the corresponding frequency, 
treating the class-interval as the unit: these are the figures in 
heavy type in Table VIII. In making these entries the sign of 
the product may be neglected, but it must be remembered that 
this sign will be positive in the upper left-hand and lower right- 
hand quadrants, negative in the two others. The frequencies are 
then collected as shown in columns 2 and 3 of Table VIIIa., 
being grouped according to the value and sign of fi;. Thus for 
^ij=l, the total frequency in the positive quadrants is 13-(-8-5 
= 21-5, in the negative 14 4-6 = 20; for ^ = 2, lO + i-5 + 1 -k4-5 
= 20 in the positive quadrants, 5 + 2 + 1 -(-3-5 = ITS in the 
negative, and so on. When columns 2 and 3 are completed, they 
should firat of all be checked to see that no frequency has been 
dropped, which may be readily done by adding together the totals 
of these two columns together with the frequency in row 4 and 
column 4 of Table VIII, (the row and column for which fij = 0), 
being careful not to count twice the frequency in the compartment 
common to the two ; this grand total must clearly be equal to the 
total number of observations N, or 235 in the present caae. The 
algebraic aum of the frequencies in each line of oolumns 2 wd 3 ia 



COBRELATIOP. 183 

TilBLR VIII. ThbobTof CoREELATiON: ExampleU.— Old-age Faujxfism and 
Proportion of Oui-relUf. (Ths TrequBncies are the figuroa printed in ordi- 
nary type. The numbers in )ieavj type ar« the Deviation -Producta ((jj).) 



oStdwn 
to One 
Indoor.. 




TotaL 
«-6 


0-6. 


6-10. 


1.^15, 


1^20. 


20-2S. 


ES-SO, 


80-»B. 


»^- 


"» 


s 


s 





'f 


z 


z 


12 


1-2 { 


s 


4 


„, 


"o 


z 


z 


i-8 ■{ 


V 


1 





1 


2 


z 


z 


tS'O 


•"* { 


V 








'V 


"o" 





z 


z 


14-0 


4-6{ 


- 


v 


1 





1 


2 

4 


-5- 


-__ 


18-0 


s-t{ 


- 


Z 


fi 


V 


*2 


■-T { 


Z 


^«" 


„ 




1-0 





S 


'e 


V 


z 


iro 


7-8 { 


z 


10 


4 



a 


z 


z 


z 


B'6 


8-B ^ 


~ 


10 


z 


7S 


»-M { 


w 


V 


6 


^ 


z 


Ift-ll 


- 


- 


- 


- 


..-« { 


z 


z 


Z 


z 


V 


Z 


z 


z 


I'D 


IS-IB { 


= 


z 


V 


Z 


- 


I 


z 


z 


^ 


1«-1B 


z 


so 


Z 


z 


z 


z 


z 


1-0 


- 


- 


- 


- 


- 


- 


- 


- 


U-W 1 




Z 


z 


1-0 



z 


S4 


z 


z 


EH 


16-17 


^ 




- 


~ 


z 


- 


- 


- 


- 


17-18 ^ 


- 


z 


z 


I'£ 


z 


z 


1-0 


M-l» { 


r 


z 


z 


Z 


S 


z 


z 


z 


L^ 


IvUII { fO 


wo 


MO 


«-o 


tS'O 


18» 


10 


1-0 


»-« 


Feroui 
Otrt-rel 


tigtlnn 
lotlUtIo 


-dptol 


R«Uet 




JwuiW- 


1 per CM 




G per Ma 
IS. 


^1^ 



THBOEY OF 6TATISTICB. 



TabLI VIHa. CAliCULATtON OF THE PBOD0OT 8dm Xfiu). 



1. 


2. 3. 


4. 


6. 8. 




Frequenctes. 




Products. 


it- 


+ 






Positive. 


Negative. 


1 


21-5 


20 


+ 1-6 


1-6 




2 


20 


11-6 


+ 8-5 


17 




3 


12 


2 


+ 10 


80 




4 
6 
6 


18 


1 


+ 17 




— 


17-6 


} 


+ 18-6 


99 


- 


8 


2 


6 


+ 1-5 


13 - 1 


9 


1-5 


1 


+ 0-6 


4-G 




10 




0-5 


+ 3-6 


S6 




12 




2 


- 2 




24 


IB 


1 




+ 1 


16 




30 




1 


- 1 




20 


24 


1 






34 




28 


1 


~ 


+ 1 


2S 


~ 


ToUU 


100-5 41-6 
41-e 
B3 


- 


+ S84 -44 

- 44 






+ 2»0 




235 





then entered in column 4, treating the frequeociea in column 3 ae if 
they were themselves negative, and finally the figures of column 4 
are multiplied by the values of £>} and the producte entered to 
column 5 or 6 according to sign. The algebraic sum of the totals 
of columns 5 and 6= +290 = S(fi;). Whencep' = S{i.))/A^= 1-234. 
To find the value of p we have, remembering that we are working 
vwith class-intervals as the unit, 



|^= - (0-153 X 0-36) = 
P=p'~h=^ -234 + 0056 = + 



-0-055 



The regreesion of pauperism on out-relief ratio is, reverting to 
1 per cent, as the unit of pauperism instead of the clasfr-intervol. 



COBBELATION. 185 

+ 034 X 645/2'98 = 0*7 i, and the regression equatioa accordingly 
a; = 0-74y, or 

X=13-9 + 0-74F, 

the standard error made in using the equation for estimating X 
from r being (j-» s/l -r' = 6-07. 

This is the equation of greatest practical interest, telling us 
that, as we pass from one district to another, a rise of 1 in the 
ratio of the numbers relieved in their own homes to the numbers 
relieved in the workhouse corresponds on an average to a rise of 
0'74 in the percentage in receipt of relief. The result is such as 
to create a presumption in favour of the view that the giving of 
out^relief tends to increase the numbers relieved, and this can be 
taken as a working hypothesis for further investigation. 

The student should work out the second regression equation, 
and check both by calculating the means of the principal rows 
and columns, and drawing a dit^ram like figs. 36, 37, and 38. 

Example iii.. Table IX. ^(Unpublished data ; measutemente by 
G. U. Yule.) The two variables are (l) X, the length of a mother- 
frond of Lemna minor; (2) Y, the length of the daughter-frond. 
The mother- frond was measured when the daughter- frond 
separated from it, and the daughter-frond when its first daughter- 
frond separated. Measures were taken from camera drawings 
made with the Zeiss-Abbe camera under a low power, the actual 
magnification being 24 : 1. The units of length in the tabulated 
measurements are millimetres on the drawii:^. 

The arbitrary origin for both X and Y was taken at 105 mm. 
The following are the values found for the constants of the single 
distributions ; — 



1^- 1-068 intorTal8 = 


6-3 mn 


.V,= 98-7 mm. on drawing 


<r.= 2-828:intor™lB= 


17-0 mm 


on drawing= 0-707 mm. actual. 


5= -0-203 „ = 


1-2 mm 


Af, = 103-8 mm. on drawing 
= 4-82 mm. actual. 



a,= 8-08* „ - 18-B mm. on drawings 0-771 mm. aotuat. 

The values of ^ are entered in every compartment of the 
table as before, and the frequencies then collected, according to 
the magnitude and sign of ^, in columns 2 and 3 of Table IXa. 
The entries in these two columns are nest checked by adding to 
the totals the frequency in the row and column for which ^ is 
zero, and seeing that it gives the total number of observations 
(266). The numbers in column 4 are given by deducting the 
entries in column 3 from those in column 2. The totals so 
obtained are multiplied by ^^ (column 1) and the products entered 



THKORY OF STATISTIC6, 



I. 


2. 


3. 


4. 


5. 6. 




Frequ 


ndes. 




Products. 


t1- 






Total. 




+ 










Quiidrant*. 


QouIrontB. 




■" ■ 




1 


_ 


8 '6 


- 8'5 




8-5 


2 


17 


13'6 


+ 3-5 


7 




8 


10-6 


9 


+ 1'6 


4-6 




4 


136 


8-5 




28 




5 


2 


0-6 


+ 1-6 


7-6 




6 


18-6 


6 


+ 8-6 


61 




8 


13 




+ 12 


98 




e 


S 


4 


+ 6 


46 




10 


8-5 




+ 6-6 


S6 




13 


17 -B 




+ I7-B 


210 




14 


1 






14 




16 


e 




+ 8 


90 




16 


7 




+ 7 


112 




18 


2 




+ 2 


38 




20 


8 






IflO 




21 


2 






42 




24 


6 






144 




26 






+ 1 


26 




28 


1 




+ 1 


28 




30 


8 




+ 3 


90 




36 


1 




+ 1 


36 




40 






+ 1 


40 




4S 


2 


- 


+ 2 


84 




60 


1 




+ 1 


60 




68 


1 


- 


+ 1 


63 


- 


TolalB 


14B-6 
40 
71-5 


49 


- 


+ 1628 -a-6 
- 8-6 


1619-5 








2B9 









in column 5 or 6 according to sign. The algebraic 'aom of the 
totals of these two columns gives 2(^7)= + 1519-5. DiTiding 
by 266, p'-5-712. But ^= +1-058 xO-203- +0-215; therfr 
fore J5-6-712 -0-215 = 5-497. 



5-497 



+ 0-63. 



■,CH)Ogle 



(ji-vGooglc 



COFEKLATION. 187 

' The regression of daugbter-froad on mother-frond iB 069 (a 
value which will not be altered by altering the units of measure- 
ment for both mother- and daughter-fronds, as such an alteration 
will affect both standard deviations equally). Hence the re- 
gression equation giving the average actual length (in millimetres) 
of daughter-fronds for mother-fronds of actual length X is 
r=l-48-fO-69X 

We again leave it to the student to work out the second 
regression equation giving the average length of mother-fronds 
for daughter-fronds of length Y, and to check the whole work 
by a diagram showing the lines of regression aud the means of 
arrays for the central portion of the table. 

17, The student should be careful to remember the following 
points in working :— 

(1) To give p' aud ^ their correct signs in finding the true 
mean deviation-product /i. 

(2) To espress <r, and <r^ in terms of the class-interval as a 
unit, in the value of r— jo/tr, o-,,, for these are the units in terms 
of which ^ has been calculated. 

(3) To use the proper units for the standard deviations (not 
class-intervals in general) in calculating the coefficients of 
regression : in forming the regression equation in terms of the 
absolute values of the variables, for example, as above, the work 
will be wrong unless means and standard deviations are ex- 
pressed in the same units. 

Further, it must always be remembered that correlation 
coefBcients, like all other statistical measures, are subject to 
fluctuations of sampling (c/. Chap. III. gg 7, 8). It we write 
on cards & series of pairs of strictly independent values of x and 
y and then work out the correlation coefficient for samples of, 
say, 40 or 50 cards taken at random, we are very unlikely ever 
to find r = absolutely, but will find a aeries of positive and 
negative values centring round 0. No great stress can therefore 
be laid on small, or even on moderately large, values of r as 
indicating a true correlation if the numbers of observations be 
smalL For instance, if ^=36, a value of r=±0-5 may be 
merely a chance result (though a very infrequent one); if 
jr=100, r= ±03 may similarly be a mere fluctuation of 
saropling, though again an infrequent one. If N= 900, a value 
of r = ± O'l might occur as a fluctuation of sampling of the same 
d^ree of infrequency. The student must therefore be careful in 
interpreting his coefficients. (See Chap. XVII. § 15.) 

Finally, it should be borne in mind that any ooeffioient, e.g. the 
coefficient of correlation or the coeffioient of contingency, givea 



188 THEORY OF STATISTICS. 

only a part of the information afforded by the original data or 
the correlation table. The correlation table itself, or the original 
data if no correlation table has been compiled, should always be 
given, UDlees considerations of apace or of expense absolutely 
preclude the adoption of such a course. 

REFERENCES. 

The theor; of correUlion wsa first developed on definite Ksanmptions 
OS to the form of the distribation of frequeooj, the so-called " normal 
distribution " (Cbap. XVI.) beiug assumed. In {!) Braroia introduced 
the product-sum, but not a aingls symbol for a coefficient of correlation. 
Sir FiunciB Oalton, in (2), (3), &nd (4), developed the piacticsl method, 
determining bis coettiaient (Qal ton's function, as it was termed at first) 
graphically. Eygeworth deTelo|>ed the theoretical aide further in (5), 
and PeareoD introduced the product-sum formula in (6) — both memoirs 
being written on the assunmtioti of a " normal " distribution of fre- 
qaency (c/. Chap. XVI.). The method Used in the preceding ohaptar 
IS based on (7) and (S). 

(1) BlUTAiB, A., " Analyse math^matique sur les probability des erreurs de 

situation d'un point," ^cmi. desSciemxs; M^moire* j^diatidB par divers 
tavanls, 11, s^rie, t. in., 184<i, p. 255. 

(2) Q ALTON, Francis, "Regression towanls Mediocrity in Hereditary 

Stature," /our. Anthrop. Iml.. vol. it., 1886, p. 246. 
.{3) Galton, Francis, "Family Likeness in Stature," Proc. Boy. Soc., 
.Vol. il., J88B, p. 42. 
(4) Qalton, Francis, " CormlationB and their Measurement," Proc. Soy. 
Soc , vol. xIt,, 1888, p. 13S. 

(6) EsoEWOKTH, F. Y., "On Correlated Averages," Phil. Mag,, Sth Series, 

voh xxxiv., 1893. p. 190. 
(8) PbAhsoh, Eael, " Regression, Heredity, and Panmixia," Phil. Tram. 
Son- Soc, Series A, vol. clxxxvii., 1896, p. 2E3. 

(7) YnLE, 0. 0., "On the significance of Bravais' Formula for Regreasion, 

etc, in the ease of Slew Correlation," Proc. Boy. Soc, voL li., 18B7, 
p. 477. 

(8) YULB, G. U., "On the Theory of Correlation," J<mr. Roy. Stat. Sac., 

vol. Ix., 1897, p. 812. 

(9) DAHBiBaiRE, A. D., "Some Tables for illustrating Statistical Correla- 

tion," Mem,, and Proc, q/* Iht Maneheeler Lit. ami Fhil. Soe., vol. li, 
1607. (Tables and diagrams illustrating the meaning of valaes of the 
correlation coefficient from t« 1 by steps of a twelfth.) 

Reference may also be made here to — 
{10} Edobworth, F. v., "On a New Method of reducing Observations 
relating to several Quantities,'' Phil. Mag., 5th Series, vol. ixiv., 1887, 
p. 222, and vol. zxv., 1888, p. 184. (A method of treating correlated 
variables differing entirely from that described in the preceding 
chapter, and baaed on the use of the median : the method involvei 
the use of trial and error to some extent. For some illustrations gee 
F. Y. Edgeworth and A. L. Bowley, Jowr. Roy. SttU. Sac, voL Ixv., 
1902, p. 341 et seq.) 

Re/ereiuxi to memoirs on the theory nf non-linear regraesitnt are given 
at themtdof Chapter X. 



n,gN..(JNGOOglC 



CX)RRKLA'nON. 



B of legieasion for the 



[As a matter of practice it is never worth calculating a correlation -coefficient 
for BO few observationB : the figures are given solely as a short example on 
which the student can-test his knowledge of the work.] 

2, The following figures show, for the districts of Example i., the ratios of 
the numbers of paupers in receipt of outdoor relief to the numbers in receipt 
of relief in the workhouse. Find the correlations between tha out-relief ratio 
and (1) the estimated earnings of agricultural labourers ; (2) the percentage 
of the population in receipt ofrelief. 



6-40 


H 


7 -60 


27 


2-(7 


1-Oi 


IG 


fU 


28 


G'8S 


7-90 


IS 


S'31 


29 


8-24 


3-81 




0-09 


80 




7-86 


18 


»-89 


81 


6-87 


0-*5 


19 


4-00 


sa 


6-M 


10-00 


20 


8-02 


88 


S-B8 


«-43 


21 


8-27 


34 


fl-»!t 


4T8 


23 


1-B8 


SB 


e-02 


4-78 


28 


16-04 


80 


4-02 



3. Verify the following data for the under- mentioned tables of the preceding 
chapter. Calculate the means of rows and columns and draw diagrams showing 
the lines of regression, as figs. 36-3S. for one or two cases at least. 





1. 


II. 


III. 


17. 


VI. 


Mean of J . 

„ y . 

lion of X . ./ 
Standard devia- \ 

tion of r . . / 
CoefflcieDtofoorre- 1 

lation . . ( 

Coefficient of con-' 
tingency(forthe 
^upiug stated J 


EE-3 mm, 
53-1 „ 

6-77 „ 
+ 0-87 


40-6 years 
42-8 „ 
127 „ 
131 „ 
+ 0-91 


67-70 ins. 

68-86 „ 
2-72 „ 
2-76 „ 

-FO-61 


5-90 
4-33 

2-97 
+ 0-21 


B09-2 

14,500 

7-48 

18,100 

-0-014 


0-90 


0-81 


0-51 


0-31 


0'47 












'>IC 



190 THEORY OP STATISTICS. 

In calculating the coefficient of contingency (coefficient uf mean square 
contingenc;} uu the fullowing groupings, bo as to avoid amall scattered fre- 
quencies at the extremities of the tables and also excessive arithmetic: — 

I. Group together (1) two top rows, (2) three bottom rows, (3) two first 
columns, (4) four Iset columns, learing centre of table as it stands. 

II. Begronp by ten-year intervals (16-, 25-, SB-, etc.) for both husband and 
wife, malting the laatgronp "96 and over." 

III. Hegroup by 2'inoh intervals, aS'C-QO'S, etc., for father, SS'ti-Sl-S, 
etc., for son. If a 3-inch groiiping be nsed (fi8'G-61'S, etc., far both father and 
son), the coefficient of mean square contingency isO'135. [Both results cited 
from Pearson, ref. 1 of Chap. V.) 

IV. For cols., group 1 + 2, 8 + 4, . . . , ll-t-12, 13 and upwatds. Bows, 
0, 1+2, 8 + 4, . . . , S + 10, 11 and upwards. 

VI. For cols., group all up tn 494*6 and alt over 621 '6, leaving central ools. 
Rows singly up 20 : then 20-28, 28-44, 44-B8, 68 upwards 



n,gN..(JNGOOglC 



COBBELATION : PEACTICAL Af PLICATIONS AND 
METHODS. 

I. Necessity for careful choicB of variiblea before piooeeding to calculatB r — 
2-8. Illnstratioti i, ; Causation of pauperism — 9-10. Illustnttion 
il : luheritBUCB of fertUitj— 11-13. lUnatration iii. : The weather 
and the crops — 14. Correlation bstweeD the movementi of two 
variablea : — (a) Non-periodic movements : Illustration iv. ; Changes 
ID infantile and general mortality — 15-17. (i) Quasi -peiiodic move- 
ments: IIIustratioD v, : The mairiage-iate and foreign trade — 
18. Elementary methods of dealing with cases of non-linear regression 
— IS. Certain rough methods of approximating to the aomtlation- 
coefficient. 

1. The student — especially the student of economic statistics, to 
whom this chapter is principally addressed — should be careful to 
note that the coeflicient of correlation, like an average or a 
measure of dispersion, only Dihibita in a sumoiary and compre- 
hensible form one particular aspect of the facte on which it if- 
b&aed, and the real ditficultiea arise in the interpretation of the 
coefficient when obtained. The value ot the coefficient may be 
coualatent with some given hypothesis, but it may be equally 
consiatent with others; and not only are care and judgment 
essential for the disciisHian of such possible hypotheses, but also 
a thorough knowledge of the facte in all other possible aspecte. 
Further, care should be exercised from the commencement in the 
seleotion of the variables between which the correlation shall be 
determined. The variables should be defined in such a way as 
to render the correlations as readily interpretable aa possible, 
and, if several are to be dealt witli, they should afford the answers 
to specific and definite questions. Unfortunately, the field of 
choice is frequently very much limited, by deiicioticies in the 
available data and so forth, and consequently practical possibilities 
as well as ideal requiremente have to be taken into account. No 
general rules can be laid down, but the following are given as 
illustrations of the sort of pointe that have to be considered. . 



192 THKOBT OF STATIBTICB. 

2. ninstration t — It is required to throw some light on the 
variations of pauperism in the unions (unions of parishes) of 
England. {Cf. Yule, ref. 2.) 

One table (Table VIII.) bearing on a part of this question, viz. 
the influence of the giving of out-relief on the proportion of the 
aged in receipt of relief, was given in Chap. IX. (p. 183). The 
question was treated by correlating the percentage of the aged 
relieved in different districts with the ratio of numbers relieved 
outdoors to the aumhers in the workhouse. Is such a method 
the best possible ! 

On the wAole, it would seem better to correlate changes in 
pauperism with ehamge» in various possible factors. If we saj 
that a high rate of pauperism in some district is due to lax 
administration, we presumably mean tliat as administration 
became iai, pauperism rose, or that if administration were more 
strict, pauperism would decrease ; if we say that the high pauper- 
ism is due to the depressed condition of industry, we mean that 
when industry recovers, pauperism will fall. When we say, in 
fact, that any one variable is a factor of pauperism, we mean 
that changes in that variable are accompanied by changes in the 
percentage of the population in receipt of relief, either in the 
same or the reverse direction. It will be better, therefore, to 
deal with changes in pauperism and possible factors. The next 
question is what factors to choose. 

3. The possible factors may be grouped under three heads : — 
(a) Administration. — Changes in the method or strictness of 

administration of the law. 

{h) Environment. — Changes in economic conditions (wages, 
prices, employment), social conditions (r^dential or industrial 
character of the district, density of population, nationality of 
population), or moral conditions (as illustrated, e.g^, by the statis- 
tics of crime). 

(c) Age Distribution. — the percentage of the population between 
given age-limita in receipt of relief increases very rapidly with old 
^e, the actual figures given by one of the only two then existing 
returns of the age of paupers being — 2 per cent, under ^e 16, 
1 per cent, over 16 but under 65, 20 per cent, over 65. (Return 
■ 36, 1890.) 

It is practically impossible to deal with more than three factors, 
one from each of the above groups, or four variables alto- 
gether, including the pauperism itself. Wliat shall we take, then, 
as representative variables, and how shall we beat measure 
" pauperism " t 

4. Pauperigm. — The returns give (a) cost, (b) numbers relieved. 
It seems better to deal with (o) (as in the illustration of Table 



CORRELATION: PRACTICAL APPLICATIONS AND MKTR0D8. 193 

VIII., Chap. IX.), as numbers are more importaot than ooat from 
the standpoint of the moral effect of relief on the population. 
The returoB, however, generall; include both lunatics and vagrants 
in the totals of persons relieved ; and as the adminiatrative methods 
of dealing with these two claaseH differ entirely from the methods 
applicable to ordinary pauperism, it seems better to alter the 
official total by excluding them. Returns are available giving 
the numbers in receipt of relief on Ist January and 1st July; 
there does not seem to be any special reason for taking the one 
return rather than the other, but the return for 1st January was 
actually used. The percentage of the population in receipt of 
relief on Ist January 1871, 1881, and 1891 (the three census 
years), less lunatics and vagrants, was therefore tabulated for each 
union. (The investigation was carried out in 1898.) 

5. AdminUlralion. — The most important point here, and one 
that lends itself readily to statistical treatment, is the relative 
proportion of indoor and outdoor relief (relief in the workhouse 
and relief in the applicant's home). The fir«t question is, 
again, shall we measure this proportion by cost or by numbers t 
llie latter seems, as before, the simpler and more important ratio 
for the present purpose, though some writers have preferred the 
statement in terms of expenditure (e.^. Mr Charles Booth, Aped 
Poor — Condition, 1894). If we decide on the statement in terms 
of numbers, we still have the choice of expressing the proportion (1) 
as the ratio of numbers given out-relief to numbers in the work- 
house, or (2) as the percentage of numbers given out-relief on 
the total number relieved. The former method was ohosen, 
partly on the simple ground that it had already been used in an 
earlier investigation, partly on the ground that the use of the 
ratio separates the higher proportions of outrelief more dearly 
from each other, and these differences seem to have signiScanoe. 
Thus a union with a ratio of 15 outdoor paupers to one indoor 
seems to be materially different from one with a ratio of, say, 10 
to 1 ; but if we take, instead of the ratios, the percentages of 
outdoor to total paupers, the figures are 94 per cent, and 91 per 
cent, respectively, which are so dose that they will probably fall 
into the same array. The ratio of numbers in receipt of outdoor 
relief to the numbers in the workhouse, in every union, was 
therefore tabulated for 1st January in the census years 1871, 1881, 
1891. 

6. Snrnnmment. — This is the most difficult factor of all to deal 
with. In Mr Booth's work the factors tabulated were (I) persons 
per acre ; (2) percentage of population living two or more to a 
room, t.e. "overcrowding"; (3) rateable value per head (jljwd/'ow — 
Condition}. The data relating to overcrowding were first colleoted 

13 



194 THKOBT OF STATBTICa. 

at the ceuHua of 1891, aad are not available for earlier years. 
Some trial was made of rateable value per head, but with not 
very satisfactory results. For any given year, and for a group of 
uniona of somewhat similar character, e.g. rural, the rateable value 
per head appears to be highly (negatively) correlated with the 
pauperism, but changes in the two are not very highly correlated : 
probably the movements of assessmeats are sluggish and irregular, 
especially in the case of fallii^ assessments in rural unions, and 
do not correspond at all accurately with the real cbauges in the 
value of agricultural land. After some oonsideration, it was 
decided to use a very simple index to the changing fortunes of a 
district, via. the movement of the population itself. If the 
population of a district is increasing at a rate above the average, 
this is primdfacie evidence that its industries are prospering ; if 
the population is decreasing, or not increasing as fast as the 
average, this strongly suggests that the industries are suffering 
from a temporary lack of prosperity or permanent decay. The 
population of every union was therefore tabulated for the censuses 
of 1871, 1881, 1891. 

7. Agt Diitribution. — As already stated, the figures that are 
known clearly indicate a very rapid rise of the percentage relieved 
after 65 years of t^. The percent^e of the population over 65 
years of age was therefore worked out for every union and tabu- 
lated from the same three censuses. This is not, of course, 
at all a complete index to the composition of the population as 
affecting the rate of pauperism, wbioh is sensibly dependent on 
the proportion of the two sexes, and the numbers of children as 
well. As the percentage in receipt of relief was, however, 20 per 
cent, for those over 65, and only 1-2 per cent, for those under that 
age, it is evidently a most important index. (A more complete 
method might have been used by correcting the observed rate of 
pauperism to the basis of a standard population with given num- 
bers of each ^e and sex. {Gf. below. Chap. XI. pp. 219-21.) 

8. The changes in each of the four quantities that had been 
tabulated for every union were then measured by working out the 
ratios for the intercensal decades 1871-61 and 1881-91, taking 
the value in the earlier year as 100 in each case. The percentage 
ratios so obtained were taken as the four 'variables. Further, as 
the conditions are and were very different for rural and for urban 
unions, it aeemed very desirable to separate the unions into groups 
according to their character. But this oannot be done with any 
exactness : the majority of unions are of a mixed character, con- 
sisting, say, of a small town with a considerable extent of the 
surrounding country. It might seem best to base the classification 
on returns of occupations, e.g. the proportions of the population 



CORRELATION : PRACTICAL AFPLICATI0M3 AND METHODS. 195 

engaged in agriculture, but the atatiBtioa of occupations are not 
given in the census for individual unions. Finally, it was decided 
to use a classification by density of population, the grouping used 
beii^ — Rural, 0'3 person per acre or less : Mixed, more than 
0-3 but not more than 1 person per acre : Urban, more than 1 person 
per acre. The metropolitan unions were also treated by them- 
selves. The limit 0'3 for rural unions was sv^gested by the 
density of those ^ricultural unions the conditions in which 
were investigated by the Labour Commissioa (the unions of 
Table VII., Chap. IX.) : the average density of these was 0-26, 
and 34 of the 38 were under 03. The lower limit of density for 
urban unions — 1 per acre — was suggested by a grouping of Mr 
Booth's (group xiv.) : of course 1 person per acre is not a density 
associated with an urban district in the ordinary sense of the 
terra, but a country district cannot reach this density unless it 
iuclude a small town or portion of a town, i.e. unless a large 
proportion of its inhabitants live under urban conditions. 

The method by which the relations between four variables are 
discussed is fully described in Chapter XII. ; at the present stE^ 
it can only be stated that the discussion is based on the correlations 
between all the possible (6) pairs that can be formed from the four 
variables, 

9. lUuitration ii. — The subject of investigation is the inheritance 
of fertility in man. (Cf. Pearson and others, ref. 3.) One table, 
from the memoir cited, was given as an example in the last chapter 
(Table IV.). 

Fertility in man (i.e. the number of children bom to a given pair) 
is very lai^ely influenced by the age of husband and wife at 
marriage (especially the latter), and by the duration of marrit^. 
It is desired to find whether it is also influenced by the heritable 
constitution of the parents, i.e. whether, allowance being made for 
the effect of such disturbing causes as age and duration of marriage, 
fertility is itself a heritable character. 

' The eflbct of duration of nkarriage may be largely eliminated 
by excluding all marriages which have not lasted, say, 16 years 
at least. This will rather heavily reduce the number of records 
available, but will leave a sufBcient number for discussion. It 
would be desirable to eliminate the effect of late marriages in 
the same way by excluding all cases in which, say, husband was 
over 30 years of age or wife over 26 (or even less) at the time 
of marriE^. But, unfortunately, this is impossible ; the age of 
the wife — the moat important factor — is only eiceptionally given 
in peerages, family histories, and similar works, from which the 
data must be compiled. All marriages must therefore be 
included, whatever the age of the parents at manit^e, and the 



196 THEOBT OF STATISTICS. 

effect of the varying age at marriage must be estimated 
afterwardB. 

10. But the correlation between (1) number of children of a 
woman and (2) number of children of her daughter will be further 
affected according as we include in the record all her available 
daughters or only one. Supptee, e.g., the number of children in 
the first generation is 5 (say the mother and her ferothera and 
sisters), and that she has three daughters with 0, 2, and 4 
children respectively : are we to enter all three pairs (5, 0), 
(6, 2), (6, 4) in the correlation-table, or only one pair? If the 
latt«r, which pair? For theoretical simplicity the second process 
is distinctly the best (though it still further limits the available 
data). If it be adopted, some regular rule will have to be made 
for the selection of the daughter whose fertility shall be entered 
in the table, so aa to avoid bias : the first daughter married 
for whom data are given, and who fulfils the conditions as to 
duration of marriage, may, for instance, be taken in every case, 
<For a much more detailed discussion of the problem, and the 
allied problems regarding the inheritance of fertility in the horse, 
the student is referred to the original.) 

11. UluBtration iii— The subject for investigation is the 
relation between the bulk of a crop (wheat and other cereals, 
turnips and other root crops, hay, etc.), and the weather. (Cf. 
Hooker, ref. 6.) 

Produce-statistics for the more important crops of Great 
Britain have been issued by the Board of Agriculture since 
1886 : the figures are based on estimates of the yield furnished 
by official local estimators all over the country. Estimates are 
published for separate counties and for groups of counties 
(divisions). But the climatic conditions vary so much over the 
United Kingdom that it is better to deal with a smaller area, 
more homogeneous from the meteorological standpoint. On the 
other hand, the area should not be too small ; it should be large 
enough to present a representative variety of soil. The group 
of eastern counties, consisting of Lincoln, Hunts, Camlnridge, 
Norfolk, Suffolk, Essex, Bedford, and Hertford, was selected as 
fulfilling these conditions. The group includes the county with 
the lai^st acreage of each of the ten crops investigated, with 
the single exception of permanent grass. 

12. The produce of a crop is dependent on the weather of 
a long preceding period, and it is naturally desired to find the 
influence of the weather at all -successive stages during this 
period, and to determine, for each crop, which period of the 
year is of most critical importance as regards weather. It must 
be remembered, however, that the times of both(^T^^^|^and 



COBEBLATION ; PEACTICAL APPLICATIONS AND METHODS. 197 

harvest are themeelves very largely dependent on the weather, 
and consequently, on an average of many years, the limits of 
the critical period will not be very well defined. If, therefore, 
we oorrelate the produce of the crop (X) with the charaoteriatics 
of the weather (T) during suocefisive intervals of the year, it 
will be aa well not to make these intervals too short. It was 
accordingly decided to take successive groups of 8 weeka, over- 
lapping each other by 4 weeka, i.e. weeks 1-8, 5-12, etc. 
Correlation coefBcienta were thus obtained at 4-weeka intervals, 
but based on 8 weeka' weather. 

13. It remains to be decided what characteristica of the weather 
are to he taken into account. The rainfall is clearly one factor 
of great importance, temperature ia another, and these two will 
afford quite enough labour for a flrat inveatigation. The weekly 
rainfalls were averaged for eight stations within the area, and 
the average taken as the first charaoteristic of the weather. 
Temperatures were taken from the records of the same Stations. 
The average temperatures, however, do not give quite the sort 
of information that is required ; at temperatures below a certain 
limit {about 42° Fahr.) there ia very little growth, and the 
growth increases in rapidity as the temperature rises above this 
point (within hmits). It was therefore decided to utilise the 
figures for " accumulated temperatures above 42° Fahr.," i.«. 
the total number of day-degrees above 42° during each of the 
8-weekIy periods, as the second characteristic of the weather ; 
these "accumulated temperatures," moreover, show much larger 
variations than mean temperatures. 

The student should refer to the original for the full dis- 
cussion as to data. The method of treating the correlations 
between three variables, based on the three possible correlations 
between them, is described io Chapter XII. 

14. Problems of a somewhat special kind arise when, dealing 
with the relations between simultaneous values of two variables 
which have been observed during a considerable period of time, 
for the more rapid movements will often exhibit a fairly close 
conailieoce, while the slower changes ahow no aimilarity. The two 
following oiamples will serve aa illustrations of two methods which 
are generally applicable to such cases. 

lUnstratioii it. — Fig. 41 exhibits the movements of (1) the 
infantile mortality (deaths of infants under 1 year of age per 1000 
births in the same year) ; (2) the general mortality (deaths at all 
agea per 1000 living) in England and Wales during the period 
1838-1904. A very cursory inspection of the figure shows that 
when the infantile mortality rose from one year to the next 
the general mortality also rose, as a rule ; and simUai^y,, ^Ij^^the 



198 



THEORY OP STATISTICS. 



infantile mortality fell, the genenl mortality also fell. There 
were, in fact, only five or six exceptions to this rule durii^ the 
whole period under revieir. The correlation between the annual 
values of the two mortalities would nevertheleBB not be very high, 
as the general mortality has been falling more or less steadily since 
1876 or thereabouta, while the infantile mortality attained almost 
a record value in 1899. During a long period of time the correla- 
tion between annual values may, iudeed, very well vanish, for the 
two mortalities are affected by causes which are to a large extent 
different in the two cases. To exhibit, therefore, the closeness of 
the relation between infantile and general mortality, tor such 
causes as show marked changes between one year and the next, it 
will be best to proceed by correlating the annual ckangen, and not 
the annual values. The work would be arranged in the following 
form (only sufficient years being given to exhibit the prinoiple of 
' the process), and the correlation worked out between the figures of 
columns 3 and 5. 



1. 


3. 


3. 


4. 


6. 




Infantile 


Increase or 


Geaeral 


iBcreBBeor 


Year. 


Mortality 


DflcraaHe 


Mortality 


Decrease 


per 1000 


from Year 


per 1000 


from Year 




Births. 


before. 


living. 


before. 


1833 


15» 


_ 


22-4 


_ 


1839 


ISl 




21-8 


-0-6 


1840 


154 


+ 3 


32-9 


+ 1-1 


184! 


145 




21 -e 


-1-3 


1812 


152 


+ 7 


217 


+ 0-1 


1S48 


150 


-^ 


21-2 


-0-6 



For the period to which the dii^ram refers, viz. 1838-1904, the 
following constants were found by this method : — 

Infantile mortality, mean annual change - 0'31 
standard deviation 9*63 

Gicneral mortality, mean annual change - 0*09 
standard deviation 1'14 

Coefficient of correlation + 0'77. 

This is a much higher correlation than would arise from the 
mere fact that the deaths of infante form part of the general 
mortality, and consequently there must be a high correlation 
between the annual changes in the mortality of those who are over 



OOEEBLATION: PRACTICAL APPLIOATIOHB AND METHODS. 199 



and under 1 year of age. {Cf. Exercisea 7 and 8, Chap. XL, and 
for method ref. 5.) 



It 





M 




- 




^ 


■} 


o 


6 





s 





taoo 


^ 




ris 


'^ 


















A 






' 












'■■vMvr»i \\j 


. 


£, 


A/ 


^ 


^ 


J^ 


V 




-^ 






























" 




^^'. 























































































18*0 SO eo JO 80 so iSOO u 

Flo. 41.— Infantile and Oenen] Mortality in England and Wales, 1SS8-1M4. 

15. lUustration v.— The two curves of fig. 42 show (1) the 
marriage-rate (persons married per 1000 of the population) for 
England and Wales ; (2) the values of exports and imports per 



es 70 7f so af ap ss i9po gjt 



S? )8ss eo es lo is ao as bo 95 laoo 05 S 

Fio. 42.— MitrrUge-rate and Foreign Trade, England and Wal«s, 18fiG-1904. 

head of the population of the United Kingdom for every year 
from 1866 to 1901. Inspection of the diagram suggests a similar 
relation to that of the last example, the one variable showing a 
rise from one year to the next when the other rises, and a bll 
when the other falls. The movement of both variables is, how- 



200 



THEORY OF STATIBTICS. 



ever, of a much more regular kind than that of mortality, 
resembling a aeries of " waves" superposed on a steady general 
trend, and it is the " waves " in the two variables — the short-period 

movements, not the slower trends— which are so clearly related. 

16, It is not diffiouit, moreover, to separate the short-period 
oscillations, more or leas approximately, from the slower movement. 
Suppose the marriage-rate for each year replaced by the average 
of an odd number of years of which it is the centre, the number 
being as near as may bo the same as the period of the " waves " — 
e.g. nine years. If these short-period averages were plotted on 
the diagram instead of the rates of the individual years, we should 
evidently obtain a smoother curve which would clearly exhibit 
bhe trend and be practically free from the conspicuous waves. 
The excess or defect of each annual rate above or below the 
trend, if plotted separately, would therefore give the "waves" 
apart from the slower changes. The figures for foreign trade 
may be treated in the same way aa the marri^;e-rate, and we 
can accordingly work out the correlation between the waves or 
rapid fluotuations, undisturbed by the movements of longer period, 
however great they may be. The arithmetic may be carried out 
in the form of the following table, and the correlation worked out 
in the ordinary way between the figures of columns 4 and 7. 



1. 


2. 


3. 


4. 


6. 


6. 


7. 










EIporte-^ 






Year. 


rate 


Nine 
Y»aw' 


Differ- 


'l^- 


Nine 
Years' 


Differ- 




and 






hea^ 








Wale.). 






(U.K.). 






186G 


16-2 


_ 


_ 


6-86 




_ 


18G6 


16-7 






1114 






1867 


18-6 






11-85 






1868 


ie-0 






10-73 






1856 


17-0 


IBi 


-fO-S 


11-72 


12-16 


-0-43 


1860 


17-1 


16 a 


+ 0'6 . 


13-03 


12-94 


-fO-09 


tS6t 


18-3 


16-7 


-0-4 


13-01 


18-62 


-0-61 


1862 


18-1 


16-8 


-.0-7 


13-40 


14-17 


-0-77 


1883 


16 -8 


16-8 


-0-1 


15-18 


14-81 


+ 0-32 


1884 


17"2 






18-48 






1886 


17-6 






16-37 






1868 


17-6 






17-72 






1867 


16-6 


~ 


- 


16-47 


- 


- 



17. Fig. 43 is drawn from the figures of columns 4 and 7, and 
shows very well how closely the osoillations of the marriage-rate 



COBBELA.TIOtI : PBACTICAL APPLICATIONS AND METHODS. 201 

are related to those of trade. For the period 1861-95 the 

correlation between the two OBcillations (Hooker, ref. 4) is 0-86. 
The method may obviously be extended by correlating the devia- 
tion of the marrit^e-rate in any one year with the deviation of 
the exports and imports of the year before, or two years before, 
instead of the same year ; if a sufficient number of years be 
taken, an estimate may be made, by interpolation, of the time- 
difference that would make the correlation a maximum if it were 
possible to obtain the figures for eiports and imports for periods 
other than calendar years. Thus Mr Hooker finds (ref. 4) that 
on an avenge of the years 1861-95 the correlation would be a 
maximum between the marriage-rate and the foreign trade of 



^, 


IB 


to 


* 






A 


T a 




a 


J 


* 


} 


a 


J 


♦J 






/ 


\ 




A 




1 


\ 




/ 


\ 






0, 




L_ 


/ 


X 




/ 


'\ 


I 


A 




/ 


_\ 






-s. 




t 


r 


i 


V 


/ 


\ 


r- 


"i 


^ 


T 




V. 


t- 








A 


A 


\/ 


A 


V 


1 


•\ 




f 


\ 













A- 


\ 




/ 




J_ 


\ 




I 


\ 






-£3 




V/ 




^ 


X 


' 


\/ 




\ 




r 


—\ 


c 


^ 


-£Z 


)8 


„ 


e 


' 


7 


' 7 


V 
8 




a 


V^ 


a 


> 


a 


y 



Fro. 48.— Fluctuation 8 in (1) Marri»ge-rfttfl »ncl (2) Foreign Trade (Exports 
+ Imports per head) in England and Waica : the Curvesshuw Deviations 
from 9-jrear mettua. Data of R. H. Hooker, Jour. Say. Slat. Soe., 1901. 

about ono'third of a year earlier. The method is an extremely 
useful one and is obviously applicable to any similar case. The 
student should refer to the paper by Mr Hooker, cited. Reference 
may aUo be made to ref. 9, in which several die^rams are given 
similar to fig. iB, and the nature of the relationship between the 
marriage-rate and such factors as trade, unemployment, etc., is 
discussed, it being suggested that the relation is even more 
complex than appean from the above. 

18. It was briefly mentioned in g 9 of the last chapter that 
the treatment of cases when the regression was non-linear was, 
in general, somewhat difficult. Such cases lie strictly outside 
the scope of the present volume, but it may be pointed out 
that if a relation between X and Y he snggeated, eitherr by 



iVii THEOBY OF 8TATIBTIC8. 

theory or by previous experience, it may be possible to throw 
that relation into the form 

where A and B are the only unknown constants to be determined. 
If a correlation-table be then drawn up between ¥ aod 0(^ 
instead oF Y and S, the regression will be approximately linear. 
Thus in Table V. of the last chapter, if X be the rate of 
discount and Y the percentage of reserves on deposits, a 
diagram of the curves of regression, or curves on which the 
means of arrays lie, suggeste that the relation between X and Y 
is approximately of the form 

A and B being constants ; that is, 

XY=A-i-BX. 

Or, it we make XY& new variable, say Z, 

Z^A + BX. 

Hence, if we draw up a new correlation-table between X and Z 
the regression will probably be much more closely linear. 
If the relation between the variables be of the form 

T=AB^ 

1(^ r-log^-l-X logB, 

and hence the relation between log Y and X is linear. Similarly, 
if the relation be of the form 



log Y= log A-n. log X, 

and so the relation between log Y and log X is linear. By 
means of such artifices for obtaining correlation-tables in 
which the regression is linear, it may be possible to do a good 
deal in difficult cases whilst using elementary methods only. 
The advanced student should refer to ret. 12 for a different 
method of treatment. 

19. The only strict method of calculating the correlation 
coefficient is that described in Chapter IX. from the formula 

S(aw) 
^~Jf ■ Approximations to this value may, however, be 

..\-.Goog\c 



OORKBLATION : PRACTICAL APPLICATIONS AND USTHODS. 203 

found in various wajB, for the most part dependent either (1) 

on the formulae for the two regressions r-^ and r— , or (2) on 

the formulee for the standard deviations of the arrays ir^-Jl -r^ 
and <r^ vl - ^- Such approximate methods are not recommended 
for ordinary use, as they will lead to different results in different 
hands, but a tew maj be given here, as being occasionally useful 
for estimating the value of the correlation in coses where the 
data are not given in such a shape as to permit of the proper "' 
calculation of the coefGcient. 

(1) The means of rows and columns are plotted on a diagram, 
and lines fitted to the points by eye, say by shifting about 
a stretched black thread until it seems to run as near as may 
be to all the points. If d,, b^ be the slopes of these two lines 
to the vertical and the horizontal respectively, 

r= Jbyby 

Hence the value of r may be estimated from any such dit^ram 
as figs. 36^0 in Chapter IX., in the absence of the original 
table. Further, if a correlation -table be not grouped by 
equal intervals, it may be dif&cult to calculate the product 
sum, but it may still be possible to plot approximately a diagram 
of the two lines of regression, and so determine roughly the 
value of r. Similarly, if only the means of two rows and 
two columns, or of one row and one column in addition to the 
means of the two variables, are known, it will still be possible 
to estimate the slopes of RR and CC, and hence the correlation 
coefficient. 

(2) The means of one set of arrays only, say the rows, ore 
calculated, and also the two standard-deviations o-, and <r^ The 
means are then plotted on a diagram, using the standard-deviation 
of each variable as the unit of measurement, and a line fitted by 
eye. The slope of this line to the vertical is r. If the standard 
deviations be not used as the units of measurement in plotting, 
the slope of the line to the vertical is rvjtry, and hence r will be 
obtain^ by dividing the slope by the ratio of the standard- 
deviations. 

This method, or some variation of it, is often useful as a 
makeshift when the data are too incomplete to permit of the 
proper calculation of the correlation, only one line of regression 
and the ratio of the dispersions of the two variables being required : 
the ratio of the quartile deviations, or other simple measures of 
dispersion, will serve quite well for rough purposes in lieu of the 
ratio of standard-deviations. As a special case, we may nojlj<^|that 



204 THEORY OF STATISTICS. 

if the two dispersions are approximately the same, the slope of 
RR to the vertical ie r. 

Plotting the medians of arrays on a diagram with the quartile 
deviations as units, and measuriug the slope of the line, was the 
method of determining the correlation coefficient ("Galton's 
function ") used bj Sir Francis Galton, to whom the introduction 
of such a coefficient is due. (Refs. 2-4 of Chap. IX. p. 188.) 

(3) If I. be the Btandard-deviatioD of errors of estimate like 
x-bj.y, we have from Chap. IX. § 11 — 



and hence 



,•.„.>(! -r>), 



V. 



1-— ,- 



But if the diaperaioDs of arrays do not differ largely, and the 
regression is nearly linear, the value of *, may be estimated from 
the average of the standard-deviations of a few rows, and r deter- 
mined — or rather estimated — accordingly. Thus in Table III., 
Chap. IX., the standard-deviations of the ten columns headed 
62-5-63-5, 63-5-64-5, etc., are— 

2-56 2-26 

2 11 2-26 

3'56 2 -45 

2-24 2-33 



The standard-deviation of the stature o 

approximately 



V.-(IS)" 



= 0-514. 

This is the same aa the value found by the product-sum method 
to the second decimal place. It would be better to take an 
average by weighting the square of each standard-deviation with 
the number of observations in the column, but in the present 
case this would only lead to a very slightly different result, vis. 
(r-2'362, r = 0'512. 

The method is clearly inapplicable to such tables as V, and VI. 
of Chap. IX., in which the means of successive arrays do not lie 
closely round straight lines. In such cases it would always tend 
to give a value for r markedly higher than that given by the 
produot-Bum r(ietho<^. That method gives a value(^b{t^,<fp the 



OOEBELATION : PRACTIOIL APPLIOAnOSB AND MITH0D8. 205 

atandard-deviation round the line of regresBioa ; the method used 
here gives a value dependent on the Htondard-deviation round a 
curve which sweeps through all the means of arrays, and the second 
standard-deviation is necessarily less than the first. The method 
thus leads to a generalised correlation-coefficient (Pearson's 
camdationr^ratio) measuring the approach towards a curvilinear 
line of regression of any form. {Ref. 12.) 

REFERENCES. 

ninstiative Applications, prlncipallr to Economic StatisticB, 
and Practical MetlLods. 



p. 813. 

(2) TuLE, 0. n., "Ad Inveati^tioD into the Causes of CbanKU io PsQperiBiii 

in England chiefly during the laat two latercansal DeosdsB," Jour. 
Boy. Stat. Soe., vol. liii., 1899, p. 249. {Of. Illuatntion L) 

(3) Feabsoh, Earl, Alice Lxk, and L. Braklev Moore, "Qenetic 

(reproductive) Selection, Inheritaiic« of Feitilitj in Mod and of 
Fecundity in thoroughbred Race-horses," PKU. Trans. Ray. Soe., Series 
A, vol, cxoii., 1899, p. 267. [Cf. Illustratioti ii.) 

(4) Hooker, R. H., " On the CoirBlation of the Marriage-rate with Trade," 

Jau.T. Jloy. Slat. Soe., voL liiv., 1901, p. 486. (The method of 
niostration v.) 

(5) Hooker, R. H., " Ou the Coirelation of Sucoesaive ObserTatjona : illna- 

tratedby Corn-prices," ibid,, vol. liviii., 1906, p. 693. (The method 
of IllnHtnitioniv.) 
(8) Hooker, R. H., "The Correlation of the Weather and the Crops," ibid., 
T0I.I11,, 1907, p, 1. (Cf. Illuatration iiL ) 

(7) Norton, J. P., SlatiaHeil Stttdies in the New York Money Market; 

Hacfflillan Co., New York, 1902. (Applications to financial statistics: 
an instantaneous average method, analogous to that of illuBtration v., is 
employed, but the instantaneous arera^ is obtained by an interpolated 
logarithmic curve, } 

(8) March, L., " Comparaison num^rique de couibea atatistiqnes,'' Jmtr, 

dt la socUU de statittiqm de Paris, 1906, pp. 26fi and SOS. (TJaea the 
methods of illuatrationa iv. and v., but obtaining the instantaneous 
average in the latter cose by graphical interpolation. ) 

(B) Yule, G. U., "On the Changes in the MarriaKS and Birth Rates in 
England and Wales during the post Half Century, with an Inquiry as 
to Uieir probable Causes," Jour. Jloy. Stat. Soe., vol. liii., 1903, p. 83. 

(10} Heron, D., On the lUlation of FertUity in Man to Social SttUvg, 
" Drapers' Co. Research Memoirs : Studios in National Deterioration," 
I. ; Dulau & Co., London, 1906. 

TliMiy of Correlation in tlie case of Non-linear BegreBsion. 

(11) Pkaxbon, Karl, " On the Systematic ITitting of Curves to ObservatioQe 
and Measurements, " Biomkriia, vol. i. p. S3fi, and vol. ii. p. 1, 1902. 
(The second part is useful for the fitting of ourvea in cases ofnon-liuear 
regression. ) 



206 THKOBT OP STATISTICS. 

(12) Pkumon, Earl, On the Qeaerai Theory of Skew Correlaiion and JVon- 

linear BegresswJi, ' ' Drapera' Go. Reseiroh Memoirs : BiomBtric Seriefl, " 
[I.; Dalau li Co., London, 19DE. (Suggests g, "correlation ratio" 
which meaaufBa the approach of the points giyen by every pair of Tiluea 
of the two Tariables x and y to a curre of any form, in the same way 
that the correlation -coefficient measures the closeiiess to a straight line, 
by utilising the atandard-detiation of arrays.) 

(13) Pearson, Eakl, "On a General Theory of the Method of False 

Position,'* Phil. Mag., June 1B03. (A method of curve fitting t^ 
the use of trial solutions. ) 
(H) Bllkbhan, J., "On Tests for Linearity of Reraflggion in f'reqnflDcy- 
distributions, " .ffnnnelrifca, vol. iv., 1E)0S, p. 332. 

Abbreviated Methods of Oalcnlatlon. 

See also references to Chapter XVI. 
(15) Habeis, J. Arthur, "A Short Method of Calcidating the Coeffloient 
of Correlation in the case of Integral Variates," Biometriha, vol. vii., 
190S, p. 214. {Not an approximation, bat a trae short metbod.) 



n,gN..(JNGOOglC 



CHAPTER XI. 

UISCZUAlTEOnS TEK0EEH3 IlTirOLTIHa IE£ USE OF 
THE GOESELATION.COEFPICIENT. 

1. Introductory — 2. Standard -deviation of a sum or difference — S. Influence 
of grouplDK of observaliona on tbe standard -deviation — ^4-5. Influence 
of enoTB of observatioD on the standard-deviation — S-7. InSuence of 
errors of observation on the correlation- coefficient (Spearmau'e 
theorems) — S- Mean and standard- deviation of an indei — 9. Correla- 
tion between indices — 10. Correlation -coefBcient for a two- x two-fold 
table — 11. Correlation coefficient for all possible pairs of iV values of a 
variable — IS. Correlationduetobeterogeneity of material — 13. Reduc- 
tion of correlation due to mingling of uncorrelated with correlated 
material— 14-17. The weighted mean — 18-lB. Application of weight- 
ing to the correction of death-rates, etc., for varying sex and age- 
distributions — 20. The weighting of forms of average other than the 
arithmetic mean. 

1. It haa already been pointed out that a statistical measure, if 
it ia to be widely useful, should lend itself readily to algebraical 
treatment. The arithmetic mean and the standard-deviation 
derive their importance largely from the fact that they fulfil this 
requirement better than any other averages or measurcB of dis- 
persion ; and the following illustrations, while giving a number of 
results that are of value in one branch or another of statistical 
work, suffice to show that the correlation-coefficient can be treated 
with the same facihty. This might indeed be expected, seeing 
that the coefficient ia derived, like the mean and standard-devia- 
tion, by a straightforward process of summation. 

2. To find the Standard-deviation of the tarn or difference Z of 
eorre^xmding valuet of two variablet X^ and X^. 

Let z, Xj, ij denote deviations of the several variables from ■ 
their arithmetic means. Then if 

evidently 

« = ar,±«a. n,gN..(JNG00Qle 

OftT o 



208 THBORY OP STATISTICS. 

Squaring both aides of the equation and Humming, 

SM-s(.,')+sW)t2j(.,.,). 

That is, if r be the correlation between x^ and x^ and <r, (r„ cr, 

the respective standard-deviationa, 

(r3 = ffi2 + o-j»±2r.<T,o-j . . . (1) 
If X, and Xj are unoorrehkted, we have the important apeoial case 

<T* = (r,i' + o-,* . . . . (2) 

The student should notice that in this case the standard- 
deviation of the sum of coireaponding values of the two variables 
is the same as the atandard-deviatioa of their difFereuce. 

The same process will evidently give the standard-deviation of a 
linear function of any number of variables. For the sum of a 
series of variables Xj, X^ . . . . X, we must have 

(r* = <r,* + (T-J-t- . . . . -^-o■,*-^2r,_.o■,<^,-^2r,J.<^,<;■. 



»*,2 being the correlation between Xj and X^, r^ the correlation 
between X^ and Xg, and so on. 

3. Infiuenee of Qriyupvng (m ike Standard-deviation. — The results 
of 5 2 may be applied to give an approximate correction for the 
effects of grouping obeervatioua ou the standard-deviation. 
Instead of assigning to any observation ita true value X, we assign 
to it the value Z corresponding to the centre of the class-interval, 
thereby making an error S, where 

Z=Z+l. 

Now regarding the frequency-distribution, to a first approxima- 
tion, as built up of a series of rectangles, like the histogram, the 
frequency being uniformly distributed over each interval, the 
correlation between X and S is zero, for the mean value of 8 is 
zero for every interval. Further, to the same degree of approiima- 
tion, the standard deviation oi 8 is 1^/12, where c is the class- 
interval (Chap. VIII. S 12, eqn. (10)). Hence if o- be the 
standard-deviation of the grouped values Z, and tr-^ the standard- 
deviation of the true value X, we have approximately 

-,■">- n . . . . m 

This is a formula of correction (Sheppard'a correction, refa. 1 to 4) 
that is very frequently used. ,-- , 



THE D8K OF THK COBHELATlOH-COEFFIOIltNT. 209 

4. Infiiiewx of Srrors of Obtervation on the Stamdard-deviation. 
— The results may be further applied to the theorj of errors of 
observation. Let ua Buppoae that, if any value of X be observed 
a targe number of times, the arithmetic mean of the observations 
is approximately the true value, the arithmetic mean error being 
zero. Then, the arithmetic mean error being zero for all values 
of X, the error, say 8, is uncorrelated with X. In this case if ^ be 
an observed deviation from the arithmetic mean, a: the true devia- 
tion, we haTe from the preceding 

<r^'=<r.^ + <ra3 . . . . (4) 

The effect of errors of observation is, consequently, to increase the 
standard-deviation above its true value. The student should 
notice that the assumption made does not imply the complete ire- 
dependeneeot .Zand S.: he is quite at liberty to suppose that 
errors fluctuate more, for example, with large than with small 
values of X, as might very probably happen. In that case the 
contingeDcy-coefficient between X and S would not be zero, 
although the correlation-coefficient might still vanish as 
supposed. 

5. If the observations be repeated so that we have in every case 
two measures x^ and x, of the same deviation x, it is possible to 
obtain the true standard-deviation cr, if the further assumption 
is legitimate that the errors S, and S^ are uncorrelated with each 
other. On this assumption 

5(«,x,).2(» + «,)(. + «,) 

and accordingly 

-. ~~^f- . . . . (6) 

(This formula is part of Spearman's formula for the correction of 
the correlation-coefficient, c/, 5 7.) 

6. Infmence of Erron of Observation on the Correlation-coefficient. 
— Let 3T], y^ be the observed deviations from the arithmetic means, 
37, y the true deviations, and 8, « the errors of observation. Of 
the four quantities x, y, S, c we will suppose x and y alone to 
be correlated. Ou this assumption 

5{^iy,) = 2(xy) . . . . (6) 

It follows at once that 



210 THEORY OF 8TATI8TICB. 

and conaeguently the obaerved correlation is less than the true 
ootrelation. This difference, it should be noticed, no mere increase 
'^in the number of obaervatioos can in any way leesen. 

7. Spea/rmam'i Theoremi. — If, however, the observations of both 
X and y be repeated, as assumed ia g 5, so that we have two 
measures Xy and x^ jr, and y^ of every value of x and y, the true 
value of the correlation can be obtained by the use of equations 
(5) and (6), on aseumptiona similar to those made above. For 
we have 

, ^{x,yy)%(x^^) _ S(a:^a)2(a!^i) 

"♦■^-■••►..."♦•wn... ■ ■ ■ ■ < ) 

Or, if we use all the four possible correlations between observed 
values of x and observed values of y. 



(>■.,, 



(8) 



Equation (8) is the original form in which Spearman gave his 
correction formula (refs. 5, 6). It will be seen to imply the 
assumption that, of the six quantities x, y, S^, S^, t^, c^, x and y 
alone are correlated. The correction given by the second part 
of equation (7), also au^ested by Spearman, seems, on the 
whole, to be safer, for it eliminates the assumption that the errors 
in X and in y, in the same series of observations, are unoorrelated. 
An insufficient though partial test of the correctness of the 
assumptions may be made by correlating a;, - x^ with yj - y^ : this 
correlation should vanish. Evidently, however, it may vauish 
from symmetry without thereby implying that all the correlations 
of the errors are zero. 

8. Mean arid Standard-deviation of an Index. — (Ref. 9.) The 
means and standard-deviations of non-linear functions of two or 
more variables can in general only be expressed in terms of tiie 
means and standard-deviations of the original variables to a first 
approximation, on the assumption that deviations are smalt 
compared with the mean values of the variables. Thus let it be 
required to Ji/nd the mean amd ttandard-deviation of a ratio or 
index Z= X-JX^, in terms of the constants for J, and X^. Let / 
be the mean of Z, M. and M~ the means of X-, and X^. Then 



-Kr:)-i-J;<'-^)('-^J" 



ogle 



THE DSK OP THE CORRELATION-COBFnCIKlIT. 211 

Expand the second bracket by the binomial theorem, assuming 
that xJM^ is bo small that powers higher than the second can 
be neglected. Then to this approximation 

That is, if r be the correlation between x^ and x^, and if u, = a-JM^ 

-f-fV -"■,<■. + •,') • . . - (9) 
If s be the standard-deviation ot Z we have 






Expanding the second bracket again b; the binomial theorem, 
and neglecting terms of all orders above the second, 

or from (9) 

■'-bS('i'-^'v, + ;') .... (10) 

9. Correlation between Indieet. — (Ret. 9.) The following prob- 
lem affords a further illustration of the use of the same method. 
Sequired to find approximately the correlation betieeen two ratioi 
Z-y — X^jX^, Z^ = XJX^ X^ X^ and X^ being uncorjvlaled. 

Let the means of the two ratios or indices be /j /^ and the 
standard-deviations 8j «g ; these are given approximately by (9) 
and (10) of the last section. The required correlation p will be 
given by 

*.^,.,=5(|;-/,)(|-/,) 



THEORY 07 aTATISTICS. 



N^lecting terms of higher order than the aeoond as before and 
remembering tbat all correlatione are zero, we have 



where, in the laat step, a term of the order n^' has again been 
neglected. Substituting from (10) for g, and a^, we have finally — 

"-jmS^TW) ■ ■ ■ <"> 

Thia value of p is obviously positive, being equal to 0'5 if 
*i ~ *a ~ "g > ^^'^ hence even if JTj and -Z^ ^re independent, the in- 
dices formed by taking their ratios to a common denominator X, will 
be correlated. The value of p is termed by Professor Pearson the 
"spurious correlation." Thus if measurements be taken, say, on 
three bones of the human skeleton, and the measurements grouped 
in threes absolutely at random, there will, nevertheless, be a 
positive correlation, probably approaching 0'5, between the indices 
formed by the ratios of two of the measurements to the third. To 
give another illustration, if two individuals both observe the same 
series of magnitudes quite independently, there may be little, if 
any, correlation between their absolute errors. But if the errors 
be expressed as percentages of the magnitude observed, there 
may be considerable correlation. It does not follow of necessity 
tbat the correlations between indices or ratios are misleading. 
If the indices are uncorrelated, there will be a similar " spurious " 
correlation between the absolute measurements Zj.X^'^Xj and 
Z^.X^^X^tHad the answer to the question whether the correlation 
between indices or that between absolute measures is misleading 
depends on the further question whether the indices or the 
absolute measures are the quantities directly determined by the 
causes under investigation (ef. ref. 11). 

The case conaidered, where X^ X^ and X^ are uncorrelated, is only 
a special one ; for the general discussion c/. ref. 9. 

10. The Correlation-coefficient for a two- x twofaid Table. — The 
correlation-coefficient is in general only calculated for a table with 
a considerable number of rows and columns, such as those given 
in Chapter IX. In some cases, however, a theoretical value is 
obtainable for the coefficient, which holds good even for the limiting 
case when there are only two rows and two columns- It is 
consequently of some interest to obtain the value for such a 
ooef&cient in terms of the class-frequencies. 



THE D8E OF THE CORKBLATION- COEFFICIENT. 
UBiag the Dotation of Chapters I.-IV, the table is 



(^« 


(rf) 


(Si 

mi 


M« 


w 


{-) 


N 



Taking the centre of the table as arbitrary origin and the class- 
interval, as usual, as the unit, the co-ordinates of the mean are 

The standard-deviations o-,, o-^ are given by 

<r,« = 0-25-|S=.(^)(a)/JP 
^/ = 0-25-^ = (5)(,8)/-^. 
Finally, 

5(^) = \[{AB) + (a^) - (A^) - (ofi)} - N^. 
Writing 

{AB)~{A){B)jN=Z 

(as in Chap. III. ^ 11-12) and replacing |, Jj by their values, 
this reduces to 

Whence 

JV.S 



^ J{A)(a)(B)(p) ■ 



(12) 



This value of r might be \ised as a coefficient of association, but, 
unlike the aBSociation-coeffioient of Chap. III. § 13, which is 
unity if either (AB) =■ (A) or (AB) = (B), r only becomes unity if 
(AB) ~ {A) — (6). This is the only case in which both frequencies 
(jtB) and (Afi) can vanish so that (AB) imd (a/3) correspond to 
the frequencies of two points J"^ Y^, X^ Y^ on a line, 

11, The. Correlation-coefflffient for all potdble pairs of N valties 
of a Variable. — In certain cases a correlation-table is formed by 
combining 2f observations in ptairs in all poesible ways. If, for 
example, a table is being formed to illustrate, say, the correlation 
between brothers for stature, and there are three brot^f^.in 



214 TBEORT OF STATISTICS. 

one famjlj' with statures 5 ft. 9, 5 ft. 10, and 5 ft 11, these are 
regarded as giving the eis pairs 

5 ft. 9 with 5 ft. 10 5 ft. 10 with 5 ft. 9 

„ 5 ft 11 5 ft. 11 „ 

5 ft. 10 „ „ „ „ 5 ft 10 

which may be entered into the table. The entire table will be 
formed from the aggregate of auch subsidiary tables, each due to 
one family. Let it be required to Sud the correlation-coefficient, 
however, for a single suli^Ediary table, due to a family with N 
members, the numbere of pairs being therefore N^(JV~ 1). 

As each observed value of the variable occurs JV-l times, 
i.e. once in combination with every other value, the means and 
standard-deviations of the totals of the correlation-table are the 
same as for the original JV observations, say M and o-. If x^ x^ 
X, .... he the o^erved deviatitHis, the product sum may be 
written 






-.,{S(.)-x,)+^(5W-a;,)+^,{SW-«.}t . 

- -v-V-V- • - • ■ - --f"*, 

whence, there being N{N ^ 1) paiis, 



mN-\),' M-1 ■ 



(IS) 

For #=2, 3, 4 . . . , this gives the successive values of r^ - 1, 
- i, - J ... . It is olear that the first value is right, for two 
values ^1, x^ only determine the two points (a„ x^) and (x^ Xj), 
and the slope of the line joining them is negative. 

The student should notice that a corresponding negative 
association will arise between the first and second member of the 
pair if all possible pairs are formed in a mixture of A'b and a's. 
Looking at the association, in fact, from the standpoint of § 10, 
the equation (13) still holds, even if the' variables can only assume 
two values, e.g. and 1. This result is utilised in g 14 of Chapter 
XIV. 

1 2. Correlation due to Heterogeneity of Material. — The following 
theorem offers some analogy with the theorem of Chap. IV. 
§ 6 for attributes. — If X and Y are vncorrelaied in eaeh qf tteo 
record*, they will neverlheleti exhibit tome correliUuM ,^A^ the 



THE USB Of THB CORRELATION-COEFFICIENT. 216 

two records are irtingled, unlets the laean value of ^ in the 
teeond record i» identical with that in the first record, or the mean 
value of Y in. the teeond record it identical with that in ifte firtt 
record, or both. 

This followB almost at once, for if Jf„ M^ are the mean values of 
JT in the two records Kj, E^, the mean yalues of Y, #j, ^j the 
numbers of obserrationB, and M, K the means when the two 
records are mingled, the product-Bum of deviations about M, K is 
JT, (if, - M)(,K^ - Z) + N^^M^ - M){E^ - K). 

Evidently the first term can only be zero if M = M^ or K = Ky 
But the first condition gives 

that is, M^ = M^ 

Similarly, the second condition gives K^ = K^. Both the first 
and second terms can, therefore, only vanish if M^ = M^ or 
K-i = K^. Correlation may accordingly be created by the mingling 
of two records in which X and Y vary round different means. 
(For a more general form of the theorem cf. ref. 17.) 

13. Beduction of Correlation due to mingling of uncorrelaled 
with correlated pairt. — Suppose that «, observations of x and y 
give a correlation-coefficient 

,.A(a)_, 

Now let n^ pairs be added to the material, the means and 
standard-deviations of x and y being the same as in the first 
series of observations, but the correlation zero. The value of 
'S,{xy) will then be unaltered, and we will have 



^-■^ .... (14) 

Suppose, for example, that a number of bones of the human 
skeleton have beon disinterred during some excavations, and 
a correlation r^ is observed between pairs of bonea presumed 
to come from the same skeleton, this correlation being rather 
lower than might have been expected, and subject to some 
uncertainty owing to doubts as to the allocation of certain 
bonea. If r, is the value that would be expected from other 
records, the difference might be accounted for on the hyg^^eBb 



216 TBKOBY OF STATISTICS. 

that, in a proportion ('"i - »'j)/j', of all the pairs, the boneB do 
not really belong to the same skeleton, and have been virtually 
paired at random. {For a more generfU form of the theorem cf. 
again ref. 17.) 

14. Th* Weighted Mean.— The arithmetic mean ^ of a serieB 
of values ot a variable X was defined as the quotient of the sum 
of these values hy their number IC, or 

If, on the other hand, We multiply each several observed 
value ot X by some numerical coefficient or weight W, the 
quotient of the sum of such products by the sum ot the weights 
is defined as a weighted mean, of X, and may be denoted by JtT ; 
80 that 

ir=2(r.z)/s(H0. 

The distinction between " weighted " and " unweighted " means 
is, it should be noted, very often formal rather than essential, 
for the " weights " may be regarded as actual, estimated, or 
virtual frequencies. Tbe weighted mean then becomes simply 
an arithmetic mean, in which some new quantity is regarded 
as the unit. Thus if we are given the means M■^, M^, M^ .... 
Mr of r series of observations, but do not know the number 
ot observations in every series, we may form a general average 
by taking the arithmetic mean of all the means, via. ^(M)Jr, 
treating the series as the unit. But if we know the number 
of observations in every series it will be better to form the 
weighted mean S(iVJf)/S(JVO> weighting each mean in proportion 
to the number ot observations in the series on which it is based. 
The second form of average would be quite correctly spoken 
of as a weighted mean ot the means of the several series : at 
the same time it is simply the arithmetic mean of all the 
series pooled together, i.e. the' arithmetic mean obtained by 
treating the observation and not the series as the unit. 
(Chap. VII. § 13.) 

15. To give an arithmetical illustration, if a commodity is sold 
at difiereut prices in different markets, it will be better to form 
an averse price, not by taking the arithmetic mean of the several 
market prices, treating the market as the unit, but by weighting 
each price in proportion to the quantity sold at that price, if 
known, t.e. treating the unit of quantity as the unit of frequency. 
Thus if wheat has been sold in market .il at an average price of 
29s. Id. per quarter, in market £ at an average price of 27b. 7d., 
and in market C at an average price of 28s. 4d., we may, if no 
statement is made as to the quantities sold at these mom (a» yery 



THE USE OF THE CORRKLATION-COEPFICIENT. 217 

often happens in the case of atatemeats as to market prices), take 
the arithmetic mean (28s. 4d.) ag the general average. But if we 
know that 23,930 qrs. were Bold at A, odI; 26 qra. at B, and 3933 
qrs. at C, it will be hetter to take the weighted mean 

(29s. Id. X 23,930) + (27b. 7d. x 26) + (288. 4d. x 3933) 
27889 ~ 

to the nearest penny. This is appreciably higher than the 
arithmetic mean price, which ia lowered by the undue importance 
attached to the small markets B and G. 

In tlie case of index-numbers for exhibiting the changes in 
average prices from year to year (c/. Chap. VII. % 25), it may 
make a sensible difference whether we take the simple arithmetic 
mean of the index-numbers for different commodities in any one 
year as representing the price-level in that year, or weight the 
index-numberB for the several commodities according to their 
importance from some point of view ; and much has been written 
as to the weights to be chosen. If, for example, our standpoint 
be that of some aver^:e consumer, we may take as the weight lot 
each commodity the sum which he spends on that commodity in 
an averse year, so that the frequency of each commodity is 
taken as the number of sbillings or pounds spent thereon instead 
of simply as unity. 

Rates or ratios like the birth-, death-, or marriage- rates of a 
country may be regarded as weighted means. For, treating the 
rate for simplicity aa a fraction, ioA not as a rate per 1000 of the 
population, 

„. , . , , total births 

B,rth-r.tt of whole country - ^^, p^p-gy^ 

_ a(birth-rate in each district x population in that district) 
~ S(population of each district) 

i.e. the rate for the whole country is the mean of the rates in the 
different districts, weighting each in proportion to its population. 
We use the weighted and unweighted means of such rates ati 
illustrations in §17 below. 

16. It is evident that any weighted mean will in general differ 
from the unweighted mean of the same quantities, and it is 
required to find an expression for this difference. If r be the 
correlation between weights and variables, a„ and a, the Btandard- 
deviations, and w the mean weight, we have at once 

S( W.X) = N{M.ib + nr^.), 
whence M' = M+r<r.^. . ^ ^^^^^^^&) 



218 



THEORY OF STATISTICS. 



~^Tbat is to say, if the weights a.nd variables are positively correlated, 
the weighted mean la the greater ; if negatively, the less. In some 
cases r is very small, and then weighting makes little difference, 
but in others the difference is large and iiuportaat, r having a 
sensible value and a^Jw a large value. 

17. The difference between weighted and unweighted means 
of death-rates, birth-rates or other rates on the population in 
different districts is, for instance, nearly always of importance. 
Thus we have the following figures for rates of pauperism 
{JouT. Stat. Soe., vol. lis. (1896), p. 349). 



January 1. 


receipt of Relfof. 


Arithnietic Mean 
ofRatoflin 

different Dtotriota. 


England and 
whole. 


18G0 
1880 

1870 
1881 
1881 


6 -51 

6-20 
G'45 
8 88 
8-2S 


e-so 

4-26 
4-77 
3 12 
2 89 



In this case the weighted mean is markedly the less, and the 
correlation between the population of a district and its pauperism 
must therefore be negative, the larger (on the whole urban) dis- 
tricts having the lower percentage in receipt of relief. On the 
other hand, for the decade 1881-90 the average birth-rate for 
England and Wales was 32*34 per thousand, the arithmetic 
mean of the rates for the different districts 30'34 only. The 
weighted mean was therefore the greater, the birth-rate being 
higher in the more populous (urban) districts, in which there is 
a greater proportion of young married persons. 

For the year 1891 the avenge population of a Poor-law district 
was found to be roughly 45,900 and the standard-deviation cr. 
56,400 (populations ranging from under 2000 to over half a 
million). The standard-deviation a-^ of the percentages of the 
population in receipt of relief was 1-24. We have therefore, 
for the correlation between pauperism and population, 



3-39 -2-69 459 



1-24 

-o-sg. 



564 



(ji-vGooglc 



THE USB OF THE CORRKLATION-COBFFICIBNT. 219 

For the birth-rate, on the other hand, assuming that ajw 
is approximately the same for the decade 1831-90 aB in 1891, 
we have, <r, being 408, 

32-34 - 30-34 459 
*"" 408 '^564 

= + -40. 

The closeness of the Qumerical values of r in the two cases is, 
of course, accidental. 

18. The principle of weighting finds one very important 
application in the treatment of such rates as death-rates, which 
are largely afiected by the age aud sex-composition of the popula- 
tion. Neglecting, for simplicity, the questioQ of eei, suppose the 
numbers of deaths are noted in a certain district for, say, the 
age-groups 0-, 10-, 20-, etc., in which the fractions of the whole 
population are p^^, p^, p^, etc., where 2(p) = l, Let the death- 
rates for the corresponding age-groups be d^, d^, d^, etc. Then 
the ordinary or erode death-rate for the district is 

D~^(d.p) - . . . . (16) 

For some other district taken as a basis of comparison, perhaps 
the country as a whole, the death-rates and fractions of the 
population in the several age-groups may be S, S, S, . . . , t, «-j, 
IT) ... , and the crude death-rate 

A = 3(S.5r) .... (17) 

Now J) and A may differ either because the <f s and S's differ 
or because the p'B and tt'b differ, or both. It may happen that 
really both districts arc about equally healthy, and the death- 
rates approximately the same for all age-classes, but, owing to a 
difference of weighting, the first average may be markedly higher 
than the second, or vice verad. If the first district be a rural 
district and the second urban, for instance, there will be a larger 
proportion of the old in the former, and it may possibly have a 
higher crude death-rate that the second, in spite of lower death- 
rates in every class. The comparison of crude death-rates is 
therefore liable to lead to erroneous conclusions. The difficulty 
may be got over by averaging the age-class death-rates in the 
district not with the weights p^ p^ Pt ■ ■ ■ ■ given by its own 
population, but with the weights, x^ ttj s-g . . . . given by the 
population of the standard district. The corrected death-rate for 
the district will then be 



220 THEORY OF STATISTICS, 

and D' and A will be comparable as regards ELge-distributioD. 
There is obviously no difficulty in taking aex into account as well 
as age if necessary. The death-rates must be noted for each sex 
separately in every age-class and averaged with a system of 
weights bEksed on the standard population. The method is also 
of importance for comparing death-rates in different classes of the 
population, t.g. those engaged in given occupations, as well as in 
different districts, and is used for both these purposes in the 
Decennial SvpplemenU to the Reports of the Begistrar General 
for England and Wales (ref. 13). 

19. Difficulty may arise in practical cases from the fact that 
the death-rates d^ d^ d^ . . . . are not known for the districts or 
classes which it is desired to compare with the standard popula- 
tion, but only the crude rates D and the fractional populations 
of the age-elasses Pi Ps Ps ■ ■ ■ ■ "^^^ difficulty may be partially 
obviated (c/. Chap. Iv. § 9, pp. 51-3) by forming what may be 
termed a potential or standard death-rate A' for the class or 
district, A' being given by 

A' = S(M .... (19) 

i.e. the rates of the standard population averaged with the 
weights of the district population. It U the crude death-rate 
that there would be in the district if the rate in every age- 
class were the same as in the standard population. An 
approximate corrected death-rate tor the district or class is 
then given by 

ff'-i'xl . . . . (20) 

P" is not necessarily, nor generally, the same as jy. It can 
only be the same if 

^{d.w ) 2(8.t) 

^d.p) wfy 

This will hold good if, e.g., the death-rates in the standard 
population -and the district stand to one another in the same 
ratio in all age-classes, i.e. Bj/d, = Sj/rfj = &Jdg = etc. This method 
of correction ia used in the Annual Summaries of the Registrar 
General for England and Wales. 

Both methods of correction— that of g 18 and that of the 
present section — are of great and growing importance. They 
are obviously applicable to other rates besides death-rates, e.g. 
birth-rates {cf. refs. 14, 15), Further, they may readily be 
extended into quite different fields. Thus it has been suggested 
(ref. 16) that correcUd average heightt or corrected average vmghU 



THE DBE OF THE COBBELATION-COEPFICIBKT. 221 

of the children in different schools might be obtained on the 
basis oi a standard school population of given age and eex 
composition, or indeed of given composition as regards hair and 
eye-colour as well, 

20, In gg 14-17 we have dealt only with the theory of 
the weighted arithmetic mean, but it should be noted that 
any form of average can be weighted. Thus a. weighted median 
can be formed by finding the value of the variable such that 
the sum of the weights of lesser values is equal to the sum 
of the weights of greater values. A weighted mode could be 
formed by finding the value of the variable for which the sum 
of the weights was greatest, allowing for the smoothing of 
casual fluctuations. Similarly, a weighted geometric mean could 
be calculated by weighting the logarithms of every value of the 
variable before taking the arithmetic iheau, i.e. 



logff.= 



2(ir.iogX) 

S{IF) ■ 



REFERENCES. 
Eilect of Oronplng Observations. 

(1) Shbppabd, W. F., "On the Calculation of tlie Average Square, Cub«, etc., 

of a Urge number of Magnitudei"," Jour. itoy. Stal. Sx., vol. \x., 18B7, 

p. ess. 

(2) SHBl'rARD, W. F., "On the Calculation of the most probable Valtiea of 

Frequency Conatants for Data arranged occordiog to Equiiliatant 
DiviaionsofaScalB," ftw. Lond. Math. Soe., vol. ixii. p. 863. (The 
result given in eqn. (3) for the correction of the standard-deviation is 
Sheppard'a nault, but the mode in which he deduces tbia and similar 
earrectionB is quite different] 
(S) 3BBPPABD, W. F., " The Calculation of Momentsof a Frequency-distribu- 
iioD," Biomtlrilca, v., 1907, p. 460. 

(4) Peabson. Karl, «nd others [editorial], "On au Elementary Proof of 

Sheppard's Fonnulte for correcting Raw Moments, and on other allied 
points," Aionu^njra, vol. iii., ISOl, p. 308. 

Effect of Errors of Observation on the Oorrelation-coefflcient 

(5) Spbabman. C., " The Prootaad Maasurement of Association between Two 

Things," Amer. Jour, of Psychology, vol. Jtv., 190*, p. 88. 
{Formula (8).) 

(6) Sfbabhan, <J., " D«moDstratioti of Formula for True Measurement of 

Correlation,' Amer. Jour, of Psychology, vol. iviii., 1907, p. 161. 
(Proof of formula (8). but on different tines to that given in the text, 
wbioh was communicated to Spearman in 19D8, and published by 
Brown and by Spearman in (7) and (8). ) 

(7) Sfbarhak, C, "Correlation calculat«d flom Faulty Data," BrUishJoiir. 

of J'rychologg, vol. iii,, 1910, p. 271. 

n„jN.«j-vljOOglC 



TBKORT OF 8TATI8TICB. 



OorrelationB between Indices, etc. 

(B) Peabson, Ei.el, " On a Fonn of Spuriona CorraUtion whioh may arise 
when iDdices are used in the Measnrement of Organs," Proe, Soy. Soc., 
vol. I»., 1897, p. 489. (§§8,9.) 

(10) Galtoh, Francis, "Nms to the Memoir by Prof. Karl Pearson on 

SpiiriouB Coml&tion," ibid., p. 498. 

(11) TuLK, G. U., "On the Interpretation of Correlationfl between Indicefor 

Katioa," ymw. Jtoy. Slat. Soc, vol. Izxiii., 1910, p. 644. 



The Weighted Mean. 



Oorrection of De&th-nirteB, etc. 

(13) Tatham, John, Supplement to tht Fiflyfifth Annual Report of the 

BtsUlTOT-Qeneral for England and Waiea: Iniroductary LeSUrs to 
Ft. I. and Ft. 11. Also SuppUmeTii to Sixty-Jifth Beport : Jntroductory 
LeUer to Pt. II. (Cd. 7769, 18BB ; 8B08, 1897 ; 2619, 1908). 

(14) NswaaoLiM, A., and T. H. C. Stkvbnhon, "The Decline of Hninui 

Fertility in the United Kingdom and other Countries, as shown by 
Corrected Birth-rates, 'Vour. Boy. Stat. Sac., vol, liij., 1906, p. 34, 

(in) YULI, O. U., "On the Changes in the Marriage- and Birt^-IUtea in 
England and Wales during the past half-centnry," etc., ibid., p. 38. 

(18) ExBON, David, "The InHuenoe of Defective Physique and Un&vonrable 
Home Environmeat on the latelligencs of School Children," Eugeniti 
Lahoratory Mmaoirs, viii., Dalan ft Co., London, 1910. 

Ulscellanecnis. 

(17) Pbaksoh, Earl, Alice Leb, and L, Bbahlkt-Hoork, "Oenetia 
(reproductiTe) Selection : Inheritance of Fertility in Man and of 
Feenndity in Thoroughbred Race-horses," I%iL IVajw. Hog. Soc., 
Series A, toI. cxcii., 1899, p. 2E7. 

(A nomher of theorems of general application are given in the intro- 
ductory part of this memoir, some of which have been utilised in §S IS- 
IS of the preceding chapter.) 

EXEBOISES. 

1. Find the values obtained for the standald-deTiatianB in Ezainplea ii. 
(p. 189) and iiL (p. 141) of Chapter VIII. on applying Sheppard'a correction 
for grouping. 

2, Show that if a range of six times the standard -deviation covers at least 
18 class-intervala {cf. Chap. VI. § fi), Sheppard'a correction will make a 
diOerence of less than O'E per cent, in the rough value of the standard- 
deviation. 

8. (Data from the decennial supplements to the Annual Beporta of the 
Kegiatrar-General for England and Wales.) The following parlioalsn ^n 



THE USE OP TBK CORRELATION- COEFFICIENT. 223 

1 which the nnmber of births io a 



Decade. 


Proportion of Hale Births 
per 1000 of all Births. 


Mean. 




1881-lSSO 

18»l-l»00 

BoUi decades 


B08-1 - 
B08-< 


IS -SO 
10-87 


608-26 


11-66 



It is believed, however, that a great pairt of the obaeri'ed standard-deviation 
is due to mere "fluotnations iif Barapling" of no real significance. 

Qiveo that the correlation between the proportions of male births in a 
district in the two decades ia-l-0'3S, estimate (1) the true standard- deviation 
freed from SQcih Buotuationa of sampling i (2) theatandard-deviation of fluctna. 
tiona of sampling, i.«. of the errors produced bj such fluctuations in the observed 
proportions of male births. 

4. (Data from Pearson, ref, 9.) The coefficients of variatioD for breadth, 
height, and length or certain skulls are 3'S9, 3'60, and 3-24 per cent respec- 
tivdy. Pind the "spurious correlation" between the breadth/length and 
height/length indices, absolute measures being combined at random so that 
the; are unoonelated. 

5. (Data from Boas, commnnicaUd to Pearson : tf. Fawcrtt and Pearson, 
Proc. Soy. Soc., vol. Izii p. 413.) From short series of measurements on 
American Indians the mean coeScient of correlation found between father and 
son, and father and daughter, for cephalic index, is0-]4 ; between mother and 
son, and mother and daughter 0'3S. Assuming these coefficients should be 
the same if it were not for ths looaanasa of family relatione, find the proportion 
of children not due to the reputed father. 

6. Find the correlation between .Z, -f Xg and Xj -)- X] ; X], X^ and Xj being 
ancorrelated. 

7. Find the correlation between X, and aX, + bX^, X] and X^ being 
uncorrelated. 

S. (Eeferring to illustration iv., g 14, Chap. Z.) Une the answer to 
qnestiou 7 to estimate, very roughly, the correlation that would be found 
between annual movementt in infantile and general mortality if the moTtality 
of those under and over 1 year of age were uncorrelated. Note that — 



*^CI00 rf'^"^ll'on } -infantile mortality per 1000 births : 



population 
year per 1000 of population. 






-I- deaths O' 

and treat the ratio of births to popQlation as if it w 
average value, say 0*033. The standard- deviation of annual 
infantite mortality ia [loc. dt. ) 9'S, and that of annual movements in mortality 
other than infantile may be taken as senaibty the same as that of general 
mortality, or say 1 unit. 
9. If Uie relation 



■, Goo»^lc 



224 THEOBT OS 8TATIBTTCB. 

holds for all tkIdcs of Xj, Xm and x, (whjcb m, in our nnisl uotstioa, 
deviations from their reBp«cUve aritHmetie means), lind the correlations 
betw««n X,, X, and x, in Urnis of their standartl-d aviations and the valnes of 

10. What is the effect OD a weighted mean of errors in the weights or the 
qnantitiea weighted, such errors being uncorrelated with each other, with the 
weights, or with tbe variables — (1) if the arithmetic mean values of the errors 
are zero ; <Z) If the aritbmetio mean valnes of the errors are not zero t 



n,gN..(jNGoogle 



CHAPTER XII. 

FABTIAL OORKELATION. 

1-2. Introdactor;«iplanatioii— 8. Direct deduction of the farniDls far two 
Tsriablea — 4. Special notation for the general esse -. seneiBlised re- 
gresmoDB— G. Generalised correlationa — ft. Generalised aeTJatioDs and 
standard-deTiatioDS — 7-S. Tbeerems concerning the generaliiied pro- 
duct-anms — 9. Direct interpretation of the generalised regressions — 
10-11. BeduotioD of the generalised standard- deviation — 12. Reduc- 
tJOD of the generalised regresaion — 13. Reduction of the generalised 
correlation -coefficient — 14. Arithmetical work : Example i : Example 
ii — IG. Geometrical reiiresentuition of coiretation between tliiee 
rariables by means of a model — 18. The coefficient of n-fold oorrslation 
— 17. EipresaioD of regressions' and correlations of lower in terms of 
those of higher order— 18. Limiting inequalities between the values of 
comlation-coefficieats neccsaarj for consistence — IB. Fallacies. 

1. In Chapters IX.-XI. the theory of the correlation-coefficient for 
a single pair of variables has been developed and its applications 
illustrated. But in tiie case of statistics of attributes we found 
it necessary to proceed from the theory of simple association for 
& single pair of attributes to the theory of association for several 
attributes, in order to be able to deal with the comples causation 
characteristic of statistics ; and similariy the student will find it 
imp(»sib]e to advance very far in the discussion of many problems 
in correlation without some knowledge of the theory of multiple 
eorrdation, or correlatioa between several variables. In such a 
problem as that of illustration i., Chap. X., for instance, it might 
be found that changes in pauperism were highly correlated 
(positively) with changes in the out-relief ratio, and also with 
changes in the proportion of old ; and the question might arise how 
far the first correlation was due merely to a tendency to give out- 
relief more freely to the old than the young, i.e. to a correlation 
between changes in out-relief and changes in proportion of old. 
The question could not at the present st^ge be answered by work- 
ing out the correlation-coefficient between the last pair of variables, 
for we have as yet no guide as to how far a correlation between 
225 15 



226 THKORT OF STATISTICS. 

the variables 1 and 2 can be accounted for b; oorrelatioDs 
between 1 and 3 and 2 and 3. Again, in the case of illustration iii, 
Chap. X., a marked positive correlation might be obeerved between, 
aay, the bulk of a crop and the rainfall during a certain period, and 
practically no oorrelation between the crop and the accumulated 
temperature during the same period ; and the question might arise 
whether the last result might not be due merely to a negative 
. correlation between rain and accumulated temperature, the crop 
being favourably aflected by an increase of accumulated temper- 
ature if other thingt vmre eqtial, but failing as a rule to obtain this 
benefit owing to the concomitant deficiency of rain. In the prob- 
lem of inheritance in a population, the corresponding problem is 
of great importance, as already indicated in Chapter IV. It is 
essential for the discuaaion of possible hypotheses to know whether 
an observed correlation between, say, grandson and grandparent 
can or cannot be accounted for solely by observed correlations 
between grandson and parent, parent and grandparent. 

2, Problems of this type, in which it is necessary to consider 
simultaneously the relations between at least three variables, and 
possibly more, may be treated by a simple and natural extension 
of the method used in the case of two variables. The latt«r case 
was discussed by forming linear equations between the two 
variables, assigning such values to the constants as to make the 
sum of the squares of the errors of estimate as low as possible : 
the more complicated case may be discussed by forming linear 
equations between any one of tfee n variables involved, taking 
each in turn, and the n - 1 others, again assigning such values to 
the constants as to make the sum of the squares of the errors of 
estimate as minimum. If the variables are Xj X^ X^ .... X„ 
the equation will be of the form 

2:j=a-H6j.Xj-hj;,X8+ .... +K.x„. 

If in such a generalised regresBion or characteristic equation we 

find a sensible positive value for any one coefficient such as b^, 
we know that there must be a positive correlation between Xj 
and X^ that cannot be accounted for by mere correlations of X, 
and Xj with X^ X^ or X„, tor the effects of changes in these 
variables are allowed for in the remaining terms on the right. 
The magnitude of h^ gives, in fact, the mean change in Xj 
associated with a unit change in X^ when all the remaining 
variables are kept constant. The correlation between Xj and 
Xj indicated by h^ may be termed a partial correlation, as 
corresponding with the partiai aesoeiation of Chapter IV., and it 
is required to deduce from the values of the coefficients f>, which 
may be termed partial r^reBsions, partial coeffldents of corre- 



PARHAL CORRELATION. 227 

lation giving the correlation between X, and X^ or other pair of 
variftbleB lehan the remaining variables X^ .... Xj, are t«pt 
comtamt, or when changes in these variables are corrected orallowed 
for, BO far as this may bedone with a linear equation. For examples 
of such generalised regression -equations the student may turn to 
the illustrations worked out below (pp. 235-24S). 

3. With this explanatory introduction, we may now proceed to 
the algebraic theory of such generalised regression-equations and 
of multiple correlation in general. It will first, however, be as 
well to revert briefly to the case of two variables. In Chapter IX., 
to obtain the greatest possible simplicity of treatment, the value 
of the coefficient r=p/tT,cr^ was dwiuced on the special assump- 
tion that the means of all arrays were strictly collinear, and tjje 
meaning of the coefGoient in the more general case was sub- 
sequently investigated. Such a process is not conveniently 
applicable when a number of variables are Co be taken into 
account, and the problem has to be faced directly : i.e. required, 
to determine the coejieientt and constant term, if ant/, in a 
regrestiojirequation, to at to make the sum of the squares of the 
errors of estimate a minimwin. We will take this problem first 
for the case of two variables, introducing a notation that can be 
conveniently adapted to more. Let us take the arithmetic 
means of the variables as origins of measurement, and let x^, x^ 
denote deviations of the two variables from their respective 
means. Then it is required to determine a^ and b^^ in the re- 
gression-equation 

x^ = a^ + b^^^:J . . . ■ (a) 

Bo as to make 2(^, - a, -)- b^J.X2Y, for all associated pairs of 
deviations Xj and x^, the least possible. Put more briefly, if 
we write ^^^___ 

JV.»;,, = S{.K,-a,.-f5,j.:r,)^ . . .(b) 

so that tj.j is the root-mean-square value of the errors of estimate 
in using regression-equation (a) (c/. Chap. IX. § 14), it is required 
to make t^_^ a minimum. Suppose any value whatever to be 
assigned to jjj, and a series of values of a, to be tried, »[ ^ being 
calculated for each. Evidently 1^2 would be very large for 
values of o, that erred greatly either in eiceas or defect of the 
best value (for the given value of ftj^), and would continuously 
decrease as this best value was approached ; the value of j,^ could 
never become negative, though possibly, but exceptionally, zero. 
If therefore the values of t^2 were plotted to the values of a^ on 
a diagram, a curve would be obtained more or less like that 
of fig. 44. The best value of a^, for which s,,2 attained its 



228 THKORY OF STATISTICS. 

miiiimum value, say (Tu, could be approximately estimated from 
Buoh a diagram ; but it cau be calculated into much more exact- 
ness from the condition that if a\ o", be two vai^ttt dote above 
and below tht belt, the oorretpondin^ vamet of i,.g are equal. Let 
a, and (a^ + S) be tvo Buch values. Then if 



2('.- 



.!(«, 



'.)' 



when 8 is very small, the value of a, is the best for the assigned 
value of fi,j. But, evidently, the equation gives, neglecting 
the term in JP, ^^_^_ 



whatever the valne of *„, This is the direct proof of the 




result that no constant term need be introduced on the right 
of a regreasion-equation when written in terms of deviations 
from the arithmetic mean, or that the two lines of regression 
must pass through the mean (Chap. IX. g 10). We may 
therefore omit any constant term. If, now, b^^ is to be assigned 
the best value, we must have, by similar reasoning, for slightly 
differing values, Jjj, ftij + 8. 

That is, ag^n neglecting terms in S\ 

Xrj(a;i-6,^a:j) = . . . .(c) 

or, breaking up the sum, 



■, Goo»^lc 



PARTIAL CORRELATION. 229 

which is the value found by the previous indirect method of 
Chapter IX. From the fact that 6,^ is determined so as to 
make the value of S(_^i - ^v^^f the least possible, the method 
of determination is sometimes called the method of leatt iqvarei. 
Evidently all the remaining results of Chapter IX. follow from 
this, and notably we have for (7,.j, the minimum value of t^j, 
the standard-deviation of errors of estimate 

,r,,,>.a,'(l-,„') . . . . (<i) 

4. Now apply the same method to the regression-equation 
for n variables. Writing the equation in terms of deviations, 
it follows from reasoning precisely similar to that given above 
that no constant term need be entered on the right-hand 
aide. For the partial regresBion-CoeeScientB (the coefBcients of 
the ^'s on the right) a special notation will be used in order 
that the exact position of each coefficient may be rendered quite 
definite. The first subscript affixed to the letter b (which will 
always be used to denote a regression) will be the subscript of 
the X on the left (the dependent variable), and the second will 
be the subscript of the x to which it is attached ; these may 
be called the pnmaxy Siibscripts. After the primary subscripts, 
and separated from them by a point, are placed the subscripts 
of all the remaining variables on the right-hand side as secondary 
subecripts. The regression-equation will therefore be written 
in the form 

^i = *.M«...«-a^ + *i«.w.,.--«i+ ■■■ +*1-.JJ.,, «-!)■'»» ■ (1) 
The order in which the secondary subscripts are written is, 
it should be noted, quite indifferent, but the order of the 
primary subscripts is material ; e.p. 6^^, . . . . , and b„g . , . . „ 
denote quite distinct coefficients, x, being the dependent variable 
in the first case and x^ in the second. A coefficient with p 
secondary subscripts may be termed a regression of the j)th order. 
The regressions Jj^, 63,, 6j„ iji, etc., in the case of two variables 
may be regarded as of onier zero, and may be termed total as 
distinct from partial r^ressions. 

5. In the case of two variables, the correlation-coefficient rjj 
may be regarded as defined by the equation 

r,j = (i„.Jj,)'- 
We shall generalise this equation in the form 

>■■>« -(V»...,.-5,,,, ■.....)' ■ . (2) 

This is at present a pure definition of a new symbol, and it 
remains to be shown that r^ji ... „ may really be regarded; laa, 



230 THEOSy OF STATISTICS. 

and poaaessea all the properties of, a correlation-coefficient ; tlie 
oame may, however, be applied to it, pending the proof. A 
wrrelatioD-coefficieat with p secondary Bubacripte will be termed 
a correlation of order p. Evidently, in the case of a correlation- 
coefficient, the order in which both primary and secondary 
subscripts is written is indifferent, for the right-hand side of 
equation (2) is unaltered by writing 2 for 1 and 1 for 2. The 
correlations rjj, r^j, etc,, may be regarded as of order zero, and 
spoken of as total, as distinct from partial, correlations. 

6. If the regresaiona b,^^ . . . . m ^ita . . . . m etc., be assigned the 
" beat " valuea, aa determined by the method of least squares, the 
difference between the actual value of ^j and the value aaaigned 
by the right-hand side of the regreaaion-equation (1), that is, the 
error of estimate, will be denot^ by x,,^ ,.,.,', i^f- "^ & defini- 
tion we have 

*l.»l.,.-=*|"-*U.M..,n'%-*lS.M...l.. ^3 -...-*!».»... I— II-*- ■ (3) 

where x^ x^ . . . . x,„ are assigned any one set of observed values. 
Such an error (or residual, as it ia aometimes called) denoted by a 
symbol with p secondary suffixes, will be termed a deviation of the 
pth order. Finally, we will define a generalised standard-deviation 
''i.a . . . . • by the equation 

f-'A.^ -VA.„ ) ■ ■ . («) 

S being, as usual, the number of obeervattons. A standard- 
deviation denoted by a symbol with p secondary sufBzea will be 
termed a standard-deviation of tbe pth order, the standard- 
deviations tr, (Tj, etc., being r^arded as of order zero, the standard- 
deviations 0-j 2 (T-j., etc., (c^. eqn. (d) of § 3) of the first order, and 

7. From the reasoning of g 3 it follows that the " least-aquare " 
values of the partial regresaiona h^^^ . . . .m etc., will be given by 
equations of the form 

2{Xi - S,i„ n-alj-f- . . . . +Vm. . . . (»-l|-*-f 

= 2(x, - {V„ + S)x^+ . . . ■ +b^^.... ,._,|.;c„)> 

8 being very amalL That ia, neglecting the term in 8*, 



i^s(«,-6,i„ ...,.*!+ .... +fti-.«.,..„-i)-i.) = 0, 
or, more briefly, in terms of the notation of equation (3), 

S(«i.«„.. ..)-0. . . . (5) 
There are a lai^e number of these equationa, (n - 1) for determin- 
ing the coefficients Ji,.,, . . . , „ etc., (w- 1) again for determimng 



PARTIAL CORRKLATIOH. 231 

the coefficiente b„„ . . . „, etc., and so on : they are sometimes 
termed the normal equations. I( the student will follow the pro- 
cess by which (5) was obtained, he will see that when the con- 
dition is expressed that £,;.„ ..... shall possess the "least- square " 
value, X2 enters into the product-sum with x^^ ...,„; when the 
same condition is espressed for 613,54 . . . . m ^a enters into the 
product-sum, and so on. Taking each regression in turn, in fact, 
every x the suffix of which is included in the secondary suffixes 
oi ^i.eg . . . . n ent«ra into the product-sum. The normal equations 
of the form (5) are therefore equivalent to the theorem — 

I7u produet-gum of any deviation of order zero with any deviation 
of higher order it zero, provided the aubaeript of the former occwr 
amtmg ike teeoiidary tubtcripti of the latter. 

8. But it follows from this that 

2(^lJt...--»!l,M...») =2itlJ....„(l!i-Sl3.4...».a^- - . - -*J-.M...|»-ll->^) 
= S{iEl.M,..-.aii). 

Similarly, 

S(a:i.M.,.«.«I.M...n) =2<>h.^«-..n). 

Similarly again, 

Z<a!L*t . . . ■ ■ a^Jt , . . (—1)) = S(iEi.M , . . » . Jit, 
and BO on. Therefore, quite generally, 

2(*LM....»-%M....-) = 2(*i.« ..,,(.-l|.«g,5..,.. 






= S(x,.„......a^) 

Comparing all the equal. product-sums that may be obtained 
in this way, we see that iht producl-tum of any two deviations la 
umaitered by omitting a/ny or alt 0/ the secondary sviseripts of either 
which are amiTtwn to the two, and, converwly, the product-sum of any 
deviation of order p with a deviation of order p + q, (A« p subtcripts 
being the same in each case, it nnaltered by adding to the secondary 
subscripts of the former any or all of the q additional subscripts of 
the latter. 

It follows therefore from (5) that any product-sum is zero if all 
the fubseripti of the one deviation occur among the secondary tub- 
seripts of the other. As the simplest case, we may note that x■^ is 
uncorrelated with x^^, and x, uncorrelated with x^_^ 

The theorems of this and of the preceding paragraph are of 
fundamentiil importance, and should be carefully remembered. 



232 THEORY OF STATIBTICa. 

9. We have now from §g 7 and 8— 

= S{aV« -■'Bum «) 

= Sxiw . . . . ■ (^1 - i>i3M , . . B . «i - terms io a;, to «,) 

= 2(a;,.%„ .)-6i»M .S(j;,.%M .) 

= S(*,j, x^^.. ..)- fi,^« 5(4m .....)■ 

That is 

A S(a'i,M ....■• .gij. .... ■ ) 

6iu4 ...,- S(4«.....) 

But this is the value th&t would have been obtAined by taking a 
regression-equation of the form 

and determining 61^3, . . . . ■ by the method of least- squares, i.e. 
^itai . , , . ■ '8 the regression of x^m . .... on «j.m ....«■ It follows 
at once from (2) that r^.^ ..... is the correlation between 
"lh . . . . n and a^j, .,..„, and from (4) that we may write 



(7) 



an equation identical with the familiar relation ^12 -^ r^j-iTj/irj, 
with the secondary suffixes 34 .... n added throughout. 

To illustrate the meaning of the e<^uation by the simplest case, 
if we had three variables only, x„ Xj, and 3;^, the value of 6,2,5 "r 
r, J, J could be determined (1) by finding the correlations r^^ and 
r^ and the corresponding regressions 6,g and 6^, ; (2) working out 
the residiials *| - fijj.Xj and x^ - b^.x^ for all associated deviations ; 
{3) working out the correlation between the residuals associated 
with the same values of Xg. The method would not, however, be 
a practical one, as the arithmetic would be extremely lengthy, 
much more lengthy than the method given below for expressing 
a correlation of order p in terms of correlations of order ^ - 1. 

10. Anystandard-deviationof orderpmay be expressed in terms 
of a standard-deviation of orders - 1 and a correlation of orderp - 1. 
For, 

— 2(iEi.m. . . (i,-i))(i^ - ^miB . . . (n-ii*n - terms in x^ to »;„_,) 
= 5K»...,-i,)-V«...(-.,2(%.«...,,.,,.:r,„...„_„) 
or, dividing through by the number of observations, 

«f.« ..... = c^.^ ... . ,,-1,(1 - 6.,« . . . . „-,) . i^.» .... „-!,) 

"^.» ......,(l-rf......«_,),„,^e;ooOLW 



PABTIAL COREKLATION. 233 

This is again the relation of the familiar form — 

with the aecoadary suffises 25 . . . . (n - 1) added throughout. 

It ie clear from (9) that r,^^ ,„_[^ like any correlation of order 

zero, cannot be numerically greater than unity. It also follows 
at once that if we have been estimating x, from x,, x^ . . . . a:„_„ 
x^ will not increase the accuracy of estimate unless r,„2|, , ,„_,) 
(not r,,) differ from zero. This condition is somewhat interesting, 
as it leads to rather unexpected results. For example, if r^j — + O'S, 
rjj = + 0-4, rjj = + 05, it will not be possible to estimate a;, with 
any greater accuracy from x^ and Xj than from x^ alone, for the 
value of ri5,j is zero (see below, § 13). 

11. It should be noted that, in equation (9), any other subscript 
can be eliminated in the same way as subscript n from the suffix of 

(Tl^ „, so that a standard-deviation of order p can be expressed 

in p ways in terms of standard- deviations of the next lower order. 
This ia useful as affording an independent check on arithmetic. 

Further, cr,_^ i„_„ can be expressed in the same way in terms 

**' "'i.a .... i»'ij> ^^d ^o on, so that we must have 

'i,....-<(i -•I.Xi --fJCi --it.) . . . (1 -r!..„...,..„) . (10) 

This is an extremely convenient expression for arithmetical use ; 
the arithmetic can again be subjected to an absolute check by 
eliminating the subscripts in a different, say the inverse, order. 
Apart from the algebraic proof, it is obvious that the values must 
be identical ; for if we are estimating one variable from n others, it 
is clearly indifferent in what order the latter are taken into account. 

12. Any regression of order p may be expressed in terms of 
regressions of order p-1. For we have 

2(*1.M . . . ». iCiW . . . «) ^ 2(ll!l.3i . . . i»-l) . iBiS. . . . ,| 

= S3;i.M . . . m-i^Xj-binM . . . (1.-1) ■ »n - termfl in a^ to a^-i) 

= 2(a^s,.. .(«-i)-«m...W-i;)- ^M...(«- i)2(a:i,M...(n-i. «„.»...(«- 

Replacing ^M.-.m-i) by *«.»... („_,).oiw...,„_,|/(riLM...(,-ii, 
we have 

61I.W . . .n-VtM . . . »=ilS.S4 . . . (n-l|. flS* . . . In-l) " im.M. . . (n-1). in3.M .. |»-1 ■ "IM ...(»- 

or, from (9), 



' "snJ* (li-ll ■ "lOM (n-1) 

The student should note that this is an expression of the form 

"■""" i-C-6-3 ..i-,Goo»^lc 



' 234 THKORY OF STATISTICS. 

with the subecnpts 34 ... . (n-l) added throughout. The 
coefficient 6^^ ..... may therefore be regarded as determined 
from a regresBion-equation of the form 

I.e. it is the partial regression of x,^ _ _ _ |,_„ on a^j.^ .... m-m 
^.tn .... (,-11 being given. As any other secondary suffix might 
have been eliminatral in lieu of n, we might also regard it aa 
the partial regression of a^i.ts . . . ■ on X2_a . . . . n > a^it . . . . n being 
given, and so on. 

13. From equation (1 1) we may readily obtain a correaponding 
equation for correlations. For (11) may be written 

1 ~ 'luSi .... (n-l) "m .... |»-11 

Hence, writing down the corresponding expression for butt . . . . n 
and taking the square root 






(l-r,„....,._„)'(l-- ,.-.)' ■ "' 

This is, similarly, the expreaaioD for three variablea 

" (i->-..)Mi-'.)' 

with the secondary subscripts added throughout, and r^^ , 

can be assigned interpretations corresponding to those of iit„ , 

above. Evidently equation (12) permita of an absolute check on 
the arithmetic in the calculation of all partial coefficients of an 
order higher than the first, for any one of the secondary suffixes 

of riij, „ can be eliminated so as to obtain another equation of 

the same form as (12), and the value obtained for fm, „ by 

inserting the values of the coefficients of tower order in the 
expression on the right must be the same in each case. 

14. The equations now obtained provide all that is necessary 
for the arithmetical solution of problems in multiple correlation. 
The best mode of procedure on the whole, having calculated all 
the correlations and standard-deviations of order zero, is (1) to 
calculate the correlations of higher order by successive applications 
of equation (12) ; (2) to calculate any required standard deviations 
by equation (10); (3) to calculate any required regressions by 
equation (8): the use of equation (11) for calculating the 
regressions of successive orders directly from each other is com- 
paratively clumsy. We will give two illustrationB, the first for 



FASTUL CORRELATION. 



235 



three aod the Becond for tour variables. The introduotioa of 
more variables does not involve any difference in the form of the 
arithmetic, but rapidly increases the amount. 

Example i. — The first illustration we shall take will be a 
continuation of esample i. of Chapter IX., in which the correla- 
tion was worked out between (1) the average earnings of agri- 
cultural labourers and (2) the percentage of the population in 
receipt of Poor-law relief in a group of 38 rural districts. In 
Question 2 of the same chapter are given (3) the ratios of the 
numbers in receipt of outdoor relief to the numbers relieved in the 
workhouse, in the same districts.' Required to work out the partial 
correlations, regressions, etc., for these three variables. 

Using as our notation X-^ = average earnings, X^ = percentage of 
population in receiptof relief, X^ = out-relief ratio, the first constants 



M,- 



15'9 shillings 
3 '67 per cent. 
5-79 



■=1-71 shillings 
= 1'29 per cent. 



,j--0-66 
13= -013 
,„= -I- 0-60 



To obtain the partial correlations, equation (12) is used direct in 
its simpteBt form — 



The work is best done systematically and the results collected 
in tabular form, especially if logarithms are used, as many of the 
logarithms occur repeatedly. First it will be noted that the 
logarithms of (1 -r)* occur in all the denominators; these had, 
accordii^ly, better be worked out at once and tabulated {col. 2 of 
the table below). In col. 3 the product term of the numerator of 



1. 


logv'i-A 


Prodoct 


Ntmun- 
(or. 


A 


^Z.. 


CMrelMIonol 
fint Order. 


9. 
logVl^rS. 


log. 


Value. 




i-arsso 


-0-0780 
-0W60 

-t-o-osss 


-0-68S0 
+0-!M0 
+0-6142 


I-7U13 


I-89938 
1 77889 


I-8MH 

i-atsM 




I'83Eie 
I-MB87 



each partial coefficient is entered, i.e. the product of the two other 
coefficients on the remaining lines in col. 1 ; subtracting this from 
the coefficient on the same Une in col. 1 we have the numerator(col. 
4) and can enter its logarithm. The logarithm of the denominator 
(col. 6) is obtained at once by addii^ the two logarithms of (1 - r*)* 
on the remaining lines of the table, and subtracting the logarithms 



230 THEORY OF BTATISTlCa, 

of the denominators from tboee of the numerators we have the 
logarithms of the correlatioa of the firstorder. It is alao as well 
to calculate at once, for reference in the calculation of atandard- 
deviationa of the eecond-order, the valuee of log Vl — H for the 
first-order coefficients (col. 9). 

Having obtained the coirelationB we can now proceed to the 
regreBsions. If we wish to find all the regression-equationB, we 
shall have sis regressions to calculate from equations of the form 

These will involve all the sis standard-deviations of the first 
order <Tj j, <r,,B, o-^,„ trjg, etc. But the standard-deviations of 
the first-order are not in themselves of much interest, and the 
standard-deviations of the second-order are so, as being the 
standard-errors or root-mean-square errors of estimate made in 
using the regression -equations of the second-order. We may 
save needless arithmetic, therefore, by replacing the standard' 
deviations of the firstorder by those of the second, omitting the 
former entirely, and transforming the above equation for &|j., 
to the form 

^i»-s=''i!-B-'^i.23/'''a-is. 
This transformation is a useful one and should be noted by the 
student The values of each a- may be calculated twice inde- 
pendently by the formulee of the form 

'',.,.-',(i-*)'a-'io)| 
-',<i~'i,)|(i-'f..)' 

BO as to check the arithmetic ; the work is rapidly done if the 
values of log Vl - r'^ bave been tabulated. The values found are 

Iogo-,j5, = 0'06U6 o-i2j = 115 

log tr3jg = I-84584 ffj,is = 0-70 

log <rn3 = 0-34571 CF3.JJ = 2-22 

From these and the logarithms of the r's we have 

l(^6ijg = 0'08116, 6,j,= -1-21 :log Vj-T-36174, 6l^J=-^0■23 
logfi,ij-T-64993, i^g= ^0'45 : log 62ji = T'33917, h^^= +0-22 
log 6si.j = ]-93024, V2= +0'85 : log 6 jj., = 0-33891, b^i= -H2-I8 

That is, the regression-equations are 

(1) «,= - 1-21 a;^ -1-0*23 ic, 

(2) Xj= -0-45 a^, -t- 0-22 «g 



PABTIAL COBBBILATION. 237 

or, transferring the origias to zero, 

(1) Earning* ^= +18-4-1-21 Xj + 0-23 JT^ 

(2) Pavperim, ^ = + 9-55 - 0-45 X[ + 0-32 X^ 

(3) Oatr^ief ratio X^~ - 15-7 + 085 Xi+2-18 X^ 

The units are throughout one shilling for the earnings X^, 1 
per cent, for the pauperism X^ and 1 for the out-relief ratio X,, 

The first and second regression-equations are those of most 
practical importance. The ai^ument has been advanced that 
the giving of out-relief tends to lower earnings, and the total 
coefficient {r^^ = — 0'13) between earnings {X^) and out-relief 
IX^, though very small (c/. Chap. IX. § 17), doea not &eem 
inconsistent with such a hypothesis. The partial correlation 
coefficient (r(,j=+0'44) and the regression-equation (1), how- 
ever, indicate that in unions with a given percentage of the 
population in receipt of relief {X^ the earningB are highest where 
the proportion of out^relief is highest ; and this ia, in bo far, 
against the hypothesis of a tendency to lower wages. It remains 
possible, of course, that out-relief may adversely affect the/wsw'ij^- 
ity of earning, t.g. by limiting the employment of the old. Aa 
fegu-ds pauperism, the argument might be advanced that the 
observed correlation {r^^= +0-60) between pauperism and out- 
relief was in part due to the negative correlation (rjj= -0-13) 
between earnings and out-relief. Such a hypothesis would have 
tittle to support it in view of the smallness and doubtful signifi- 
cance of fj^ and is definitely contradicted by the positive partial 
correlation r^^^ = + 0"69, and the second regression-equation. The 
third regression-equation shows that the proportion of out^reJief is 
on the whole highest where earnings are highest and pauperism 
greatest. It should be noticed, however, that a negative ratio is 
clearly impossible, and consequently the relation cannot be strictly 
linear ; but the third equation gives possible (positive) average 
ratios for all the combinations of pauperism and earnings that 
actually occur. 

Example ii. — (Four variables.) As an illustration of the form 
of the work in the case of four variables, we will take a portion 
of the data from another investigation into the causation of 
pauperism, viz. that described in the first illustration of Chapter X., 
to which the student should refer for details. The variables are 
the ratios of the values in 1891 to the values in 1881 (taken as 
100) of— 

1. The percentage of the population in receipt of relief, 

2. The ratio of the numbers given outdoor relief to the numbers 
relieved in the workhouse, 

3. The percentage of the population over 65 years of ago, 



238 THEORY OF aTATISTIOS. 

4. The population tteelf, 
in tlie metropolitan group of 32 unioDB, and the fundameatal 
ooQStante (means, ataodfurd-deTiatioDs and corretations) are as 
follows ; — 

Tablx L 



1. 


2. 


S. 


4. 


Ueans. 


d«yi»Uoti». 


Correlation- 
ooefficiraL 


bg VT^. 


1 


1047 


1 


29-2 


12 


+ 0-62 


'931&4 


2 


BO-6 


2 


417 




+ 0-41 


•99003 


3 


1077 


3 


6-6 


IJ 


-0-H 




i 


111-3 




2S-8 


23 


+ 0-49 


■94038 










24 


+ 0-23 


■98820 


— 


- 


" 


- 


3* 


+ 0-26 


'98598 



It is seen that the average changes are not great ; the per- 
centageB of the population in receipt of relief have increased on 
an average bj' 4'7 per cent,, the out-relief ratio has dropped b; 
9-4 per cent., and the percentile of old hag increased by 7*7 
per cent., at the same time as the population of the unions has 
risen on the average by 11 3 per cent. At the same time the 
standard-deviations of the firs^ second, and fourth variables are 
very lai^e. As a matter of fact, while in one union the 
pauperism decreased by nearly 50 per cent, and in others by 
20 per cent., in some there were increases of 60, 80, and 90 
per cent. ; similarly, in the case of the outrrelief, in several unions 
the ratio was decreased by 40 to 60 per cent, a consistent 
anti-out-relief policy having been enforced ; in others the ratio 
was doubled, and more than doubled. As regards population, 
the more central districts show decreases ranging up to 20 and 
25 per cent., the circumferential districts increases of 46 to 80 
per cent. The correlations of order nero are not large, tha 
changes in the rate of pauperism exhibiting the highest correlation 
with changes in the out-relief ratio, slightly leas with changes 
in the proportion of old, and very little with changes in 
population. 

The correlations of the second order are obtained in two steps. 
In the first place, the six coefficients of order zero are grouped in 
four sets of three, corresponding to the four sets of three variables 
formed by omitting each one of the four variables in turn (Table 
II. col, 1). Each of these sets of three coefficients is then 
treated in the same manner as in the last example, and so the 



PARTIAL COBRELATION. 



1. 

Correlstioil- 
(Zero Order). 


2. 

Product 
Term of 

Numerator. 


8. 


4. 

GomOation- 

ooefficient 

(Firat Order). 


6. 


12 
13 
2S 

12 
I* 
2t 

13 
14 
84 

2S 
2i 

a4 


+ 0-62 
+ 0-41 
+ 0-49 

+ 0-62 
-0'14 
+ 0-23 

+ 0-41 
-0-14 
+ 0-26 

+ 0'4S 
+ 0-23 
+ 0-25 


+ 0-2009 

+ 0-2HB 
+ 0-2132 

-0-0822 
+ 0-ll« 
-0-0728 

-0-03BO 
+ O-1026 

-0-0674 

+ 0-0576 
+ 0-1226 
+ 0-1127 


+ 0-8181 
+ 0-1662 
+ 0-2768 

+ 0662! 
-0-25B8 
+ 0-3028 

+ 0-4*60 
-0-24 35 
+ 0-3074 

+ 0-4326 
+ 0-1075 
+ 0-1373 


12-3 
18-2 
28-1 

12-4 
14-2 
24-1 

18 4 

14-3 
34-1 

23-4 
24-3 

34-2 


+ 0-4013 

+ 0-2084 
+ 0-8663 

+ 0-6731 
-0-3128 
+ 0-8GS0 

+ 0-4042 

-0-3748 
+ 0-3404 

+ 0-45S0 
+ 0-1274 

+ 0-1618 


T-96187 

I-9S036 
1-97070 

T-91366 

1-97772 

1-97022 

T -94781 
1-98297 
1-97326 

T-94S63 
1-99646 
I'98424 



1. 
Correlation- 
(Firat Order). 


2. 

Product 
Termor 
Nxunentor. 


3. 

Numerslor. 


4. 

ooefficieut 
(Second Order). 


6. 


12-4 
18-4 

28-4 


+ 0-6731 

+ 0-4642 
+ 0-4600 


+ 0-2181 
+ 0-2881 
+ 0-2S60 


+ 0-3800 
+ 0-2011 

+ 0-1830 


12-34 
13-24 

S3 14 


+ 0-467 
+ 0-276 
+ 0-268 


1-84901 
1-98277 
1'8840S 


12-3 
14-3 
21-3 


+ 0-4013 

-0-2746 
+ 0-1274 


-0-0860 
+ 0-0611 
- 0-1102 


+ 0-4363 
-0-3257 
+ 0-2376 


12-34 
14-23 
24-13 


+ 0-467 
-0-869 
+ 0-270 


T-87013 
IS8869 


13-2 
14-2 
84-2 


+ 0-2084 
-0-3128 
+ 0-1618 


-0-0606 
+ 0-0337 
-0-0661 


+ 0'2589 
-0-8460 

+ 0-2268 


13-24 
14-23 
3412 


+ 0-278 
-0-369 
+ 0-244 


1-98684 


2S-1 
24-1 

84-1 


+ 0-8663 
+ 0-3680 
+ 0-3404 


+ 0-1219 
+ 0-1208 
+ 0-1272 


+ 0-2884 
+ 0-2371 
+ 0-2132 


28-14 
2413 
S4-1S 


+ 0-266 
+ 0-270 
+ 0-244 


- 

















240 THEOBT OF STATISTICS. 

correlations of the first order (Table II. col. 4) are obtained. 
The first-order coefficientB are then regrouped in sets of three, 
with the same secondary suffix (Table III. ool. 1), and these 
are treated precisely in tbe same way as the coefficients of order 
zero. In this way, it will be seen, the value of each coefficient 
of the second order is arrived at in two ways independently, and 
BO the arithmetic is checked : r^^.^ occurs in the first and fourth 
lines, for instance, r^g.^^ in the second and seventh, and so on. 
Of course slight differences may occur in tbe last digit if a 
sufficient number of digits is not retained, and for this reason the 
intermediate work should be carried to a greater degree of 
accuracy than is necessary in the final result ; thus four places 
of decimals were retained throughout in the intermediate work of 
this esample, and three in the final result. If he carries out an 
independent calculation, the student may differ slightly from 
the If^ritlims given in this and the following work, if more or 
fewer figures are retained. 

Having obtained the correlations, tbe regression can be calcu- 
lated from the third-order standard-devtationa by equations of the 
form (as in the last example), 

80 the standard-deviations of lower orders ueed not be evaluated. 
Using equations of the form 

,r,™ = ,r,(l~W.)'(l-rf„)'(l-rf.,^)' 
= <r,(l -rf.)'(I -rl,)'(l -r?i«)t 
we find 

■logcr,2„=l-357+0 o-,j3,=22-8 

log tra 134 = 1 -50597 trj.j„ = 32-1 

1<^ trg,5,=0-65773 o-M2, = i-55 

loga-^,ijj = l-329U T^,^=21-3 

All the twelve regressions of the second order can be readily 
calculated, given these standard deviations and the correlations, 
but we may confine ourselves to the equation giving the changes 
in pauperism (JIT,) in terms of other variables as the most impor- 
tant. It will be found to be 

ar, = 0-325xij -I- 1 -383^3 - 0-383x^, 

or, transferring the origins and expressing the equation in terms of 

percentage-ratios, 

X^= - 311 -K 0-335X2 -^ I-383Xi, - 0-383X,, 



PABTUL CORRELATION. 241 

or, agaiD, is terms of percentage-changes (ratio - 100), Percent- 
age change in pauperism 

= +l'i percent. 

+ 0'325 times the change in out-relief ratio. 
+ 1383 „ „ proportion of old. 

— 0'383 „ „ population. 

These results render the interpretation of the total coefficients, 
which might be equally consistent with several hypotheses, more 
clear and definite. The questions would arise, for instance, 
whether the correlation of changes in pauperism with changes in 
out-relief might not be due to correlation of the latter with the 
other factors introduced, and whether the negative correlation with 
changes in population might not be due solely to the correlation 
of the latter with changes in the proportion of old. As a matter 
of fact, the partial correlations of charges in pauperism with 
changes in out-relief and in proportion of old are slightly less than 
the total correlations, but the partial correlation with changes in 
population is numerically greater, the figures being 



+ 0-52 


r,j.„=-hO-46 


+ 0'41 


r,s"= +0-28 


-0-14 


r,^„= -036 



So far, then, as we have taken the factors of the case into 
account, there appears to be a true correlation between changes 
in pauperism and changes in out-relief, proportion of old, and 
population — the latter serving, of course, as some index to 
changes in general prosperity. The relative influences of the 
three factors are indicated by the regression-equation above. 
[For the full discussion of the case cf. Jotir. Bay. Stat. Soc., 
vol. Mi., 1899.] 

15. The correlation between pauperism and labourers' earnings 
exhibited by the figures of Example i. was illustrated by a diagram 
(fig. 40, p. 180), in which scales of "pauperism" and "earnings" 
were taken along two axes at right angles, and every observed 
pair of valuea was entered by marking the correaponding point 
with a small circle : the diagram was completed by drawing in 
the lines of regression. In precisely the same way the correlation 
t>etween three variables may be represented by a model showing the 
distribution of points in space ; for any set of olwerved values X^, 
X^, X^ may be r^arded as determining a point in space, just as 
any pair of values X^ and X^ may be regarded as determining a 
point in a plane. Fig. 45 is drawn from such a model, constructed 
from the data of Example i. Four pieces of wood are fixed together 

16 



242 THBORT 0! STATISTICS. 

like the bottom and three sides of a bos. Supposing the open 
side to face the observer, a scale of pauperiBm ia drawn vertically 
npirards along the left-hand angle at the back of the "box," the 




Fra. 46.— Model Ulnstrftting the Correlation between tliree Variables; (1) 
Pauperiam (perceotage of the population in receipt of Poor-law relief) ; 
(2) Out-relief ratio (numbers given relief in their homes to one in the 
workhouse); (3) Average Weeklj Earnings of agrioultaral labourers, 
(data pp. 178 and 189). J, front view ; B. view ot model tilted till the 
plane orregreaaion for pauperism on the two remaining variables ia seen 
as a straight line. 



PABTUL CORRELATION. 243 

scale Btarttng from zero, as very small values of pauperism occur : 
a aoale of outrelief ratio is taken along the angle between the 
back and bottom of the box, starting from zero at the left : finally, 
the scale of earnings is drawn out towards the observer along the 
angle between the left-hand side and the bottom, but as earnings 
lower than 12fi. do not occur, the scale may start from 12a. at the 
coraer. Suitable scales are : pauperism, 1 in. = 1 per cent. ; out- 
relief ratio, 1 in. = I unit; earnings, 1 in. = lg.; and the inside 
measures of the model may then be 17 in. x 10 in. x 8 in. high, 
the dimensions of the model constructed. Given these three 
scales, any set of observed values determine a point within the 
" boi." The eaminfft and out-relief ratio for some one union are 
noted first, and the corresponding point marked on the baseboard ; 
a steel wire is then inserted vertically in the base at this point 
and cut oiF at the height corresponding, on the scale chosen, to 
the pauperism in the same union, being finally capped with a 
small ball or knob to mark the "point" clearly. The model 
shows very well the general tendency of the pauperism to be the 
higher the lower the wages and the higher the out-relief, for the 
highest points lie towards tlie back and right-hand side of the 
model. If some representation of all three equations of regression 
were to be inserted in the model, the result would be rather 
confusing ; so the most important equation, viz, the second, giving 
the average rate of pauperism in terms of the other variables, may 
be chosen. This equation represents a plane : the lines in which 
it cute the rightr and left-hand sides of the "bos" should be 
marked, holes drilled at equal intervals on these lines on the 
opposite sides of the box (the holes facing each other), and threads 
stretched through these holes, thus outlining the plane as shown 
in the figure. In the actual model the correlation-diagrams (like 
fig. 40) corresponding to the three pairs of variables were drawn 
on the back sides and base : they represent, of course, the eleva- 
tions and plan of the points. 

The student possessing some skill in handicraft would find it 
worth while to make such a model for some case of interest to 
himself, and to study on it thoroughly the nature of the plane of 
regression, and the relations of the partial and total correlations. 

16. If we write 

'^a..,,n = '^(l--ff5(!,..,.™)- ■ (13) 

it may be shown that ^„j, . . . . ^ is the correlation between 
Xj and the expression on the right-hand side of the regression- 
equation, say 6,53 , . . . „ where 

eLV,..... = ^11M...U-^2 + ii3-a...n-^i+ ■ ■■ +''.n.M...(«-l,-«n ■'^I^^W) 



244 TBKORY OF STATISTICE 

For we have 

and also 

2(4.. . . . .)-2(i,-.,„, , . .)'-ir(„i-^„, . , , .) 

whence the correlation between ^^ and e, ,3 „ is 

w -<■...., . y 

i.e. the value of ^,,23 .,.,„) given by (13). The value of £ is 
accordingly & useful datum as indicating how closely ' :Cj can 
be eipressed in terma of a linear function of x^, x^ . . . . x„, and 
the values of the regressions may be regarded as determined 
by the condition that if shall be a maximum. Its value is 
essentially positive as the product-sum 2{a;,.ei.a ,.,.„) is positive. 
K maybe termed a coefficient of (n-l)-fold (or double, triple, 
etc.) correlation ; for n variables there are n such correlations, 
but in the limiting case of two variables the two are identical. 
The value may be readily calculated, either from tr,^ . . . , „ and 
0-, or directly from the equation 

i-s;„,.,.,.{i -r;,Ki -,f„)(i -<j ... (1 -t!„. .,.-„). (IS) 

It is obvious from this equation that since every bracket on 
the right is not greater than unity, 

1-^^ >l-»^,. 

Hence B^^ ....«) cannot be numerically less than r,^. For the 
same reason, rewriting (15) in every possible form, S^^ , . . m 
cannot be numerically less than r^ r,,, ■ ■ • ■ '■im »■«■ any one 
of the possible constituent coefficients of order eero. Further, 
for similar reasons, Rnn . . . . ^ cannot be numerically less than 
any possible constituent coefficient of any higher order. ' That 
is to say, .£,1,3 _ „, ia not numerically less than the greatest 
of all the possible constituent coefficients, and is usually, though 
not always, markedly greater. Thus in Example i., jBe,u( 
(the coefficient of double correlation between pauperism on 
the one hand, out-relief and labourers' earnings on the other) 
is 0839, and the numerically greatest of the possible constituent 
coefficients is r,j,g= -0'73. ■ Again, in Example ii., Hu^, is 
0'626, and the numerically greatest of the possible constituent 
coefficients is »']2.,= +0573. 

The student should notice that M is necessarily positive. 
Further, even if all the variables Xj, Jj, . . . . X„ were strictly 
uncorrelated in the original universe as a whole, we should expect 
''i2> *'ii.v '"14.S8' ^'*i ** exhibit values (whether positive or negative) 



PABTIAL COEKKLATION. 245 

differing from zero in a limited sample. Hence, R will not 
tend, OD an average of such samples, to be zero, but will 
fluctuate round some mean value. This mean value will 
be the greater the smaller the number of obsetvations in the 
sample, and also the greater the number of variables. When 
only a a mall number of observations are available it is, 
accordingly, little use to deal with a large number of variables. 
As a limiting case, it is evident that if we deal with n variables 
and possess only m observations, all the partial correlations 
of the highest possible order will be unity. 

17. It is obvious that as equations (11) and (12) enable us to 
express regressions and correlations of higher orders in terms of 
those of lower orders, we must similarly be able to express the 
coefficients of lower in terms of those of higher orders. Such 
expressions are sometimes useful for theoretical work. Using the 
same method of expansion as in previous oases, we have 

= 2(»,.j3 „.3t.m ,^n) 

= '^Xj.X^M (--1|)-*11M n'^'^i-^ZH. . . . («-ll) 

- *!-.» .... (H-v 2(-r« ■ -^ist w-ij) 

That is, 

&II.M .... W-IJ = *1IJ< ~ + *1~-J3 l»-ll ■ ^nrat I— II- 

In this equation the coefGcient on the left and the last on the 
right are of order n — 3, the other two of order n - 2, We therefore 
wish to eliminate the last coeEBcient on the right. Interchanging 
the suffixes 1 for n and n for 1, we have 



■+1>H,.: 



Substituting this value. for b^„ _ _ ,„_,, in the first equation we 
This is the required equation for the regressions ; it is the equation 



" 1 - b,^., . 6,,., 

with secondary suffixes 34 .... (m - 1) added throughout. The 
corresponding equation for the correlations is obtained at once 
by writing down equation (16) for fi^.x .... i~-ii *nd taking the 
square root of the product (ef. % 13) ; this gives 



^ItM .... a+*'lii.M .... tn-ll ■''in.H .... (.-1 

(1 -?„...., .-,i)'(l -<„....,„„)' 



(17) 



THEORY OP STATISTICS. 



wliich is similarly the equation 



■-, + ru.t 



ivilh the secondary suffixes 34 .... (n — 1) added throughout. 

18. Equations (12) and (17) imply that certain limiting 
ineqiialities must hold between the correlation-coefficients in 
the expression on the right in each case in order that real 
values (values between ±1) may be obtained for the correlation- 
coefficient on the loft. These inequalities correspond precisely 
with those "conditions of consistence" between claea-frequencies 
with which we dealt in Chapter II., but we propose to treat them 
only briefly here. Writing (12) in its simpleBt form tor rjjj, 
we must have rJi3< 1 or 






<l. 



r'a+rli + fii- 2r,^r^rj, < 1 
if the three r'a are consistent with each other, 
ae known, this gives as limits for rjg 

n2''is± ■v'l-rja-rjj + i^rf,. 



*1ii + <. + »i.i + 3''iisns.,»-iBi<l . . (19) 

and therefore, if r^j j and rj^j are given, r^^j must lie between 

the limits 

The following table gives the limits of the third coefficient in 
a few special cases, for the three ooeffioiente of zero order and 
of the first order respectively : — 



Value of 


Limits of 1 


morrit.1. 


n» or nt.a 


ra 


T3i.l. 








±1 


±1 


+1 


+1 


■H 






+1 


^1 


+ 1 


■(-VO'6 


±V0-6 


0, -n 


0. -1 


±Vo^ 


TVO-6 


0. -1 


0, +1 



Gooj^le 



PARTIAL CORBBLATION. 247 

The student ahould notice that the Bet of three coef&cienta of 
order zero and value unity are only conaistent if either one only, 
or all three, are positive, i.e. +1, +1, + 1, or - 1, ■• 1, + 1 ; but 
not —1, —1, —1. On the other hand, the set of three coeflicienta 
of the first order and value unity are only consistent if one only, 
or all three, are negative : the only consistent sets are 4- 1, +1, 
- 1 and - 1, — 1, — 1. The values of the two given r's need to 
be Tery high it even the sign of the third can be inferred ; if the 
two are equal, they muat be at least equal to ^05 or "VO? .... 
Finally, it may be noted that no two values for the known 
coefficients ever permit an inference of the value zero for the 
third ; the fact that 1 and 2, 1 and 3 are uncorrolated, pair and 
pair, permits no inference of any kind as to the correlation 
between 2 and 3, which may lie anywhere between + 1 and - 1. 

19. We do not think it necessary to add to this chapter a 
detailed discusgiun of the nature of fallacies on which the theory 
of multiple correlation throws much light. The general nature of 
such fallacies is the same as tor the case of attributes, and was 
discussed fully in Chap. IV. ^ 1-8. It suffices to point out the 
principal sources of fallacy which are suggested at once by the 
form of the partial correlation 



(■"> 



m 



and from the fonn of the corresponding eipreesion for r^^ 
of the partial coetBcients 

r n.i + '-i t.-'-M.i 

From the form of the numerator of (a) it is evident (1) that even 
if r,j be zero, r,,, will not be zero unless either r,j or r^g, or 
both, are zero. If r,j and r^ are of the same sign the partial 
association will he positive ; if of opposite sign, negative. Thus 
the quantity of a crop might appear to be unaffected, say, by 
the amount of rainfall during some period preceding harvest ; 
this might be due merely to a correlation between rain and low 
temperature, the partial correlation between crop and rainfall 
being positive and important. We may thus easily misinterpret 
a coefficient of correlation which is zero. (2) r^jj may be, indeed 
often is, of opposite sign to rij, and this may lead to still more 
serious errors of interpretation. 

From the form of the numerator of (6), on the other hand, we 
see that, conversely, rjj will not be zero even though r-^^^ is zero, 
unless either r^^^ or r^^ is zero. This corresponds to the theorem 



248 THBORT OF STATISTICS. 

of Chap. IV. % 6, and indicates a source of fallacies similar to 
those there discussed. 

20. We have seen (g 9) that r^.j is the correlation between ar,.g 
and ^j.g, and that we might determine the value of this partial 
correlation by drawing up the actual correlation table tor the two 
residuals in question. Suppose, however, that instead of drawing 
up a single table we drew up a series of tables for values of x, , 
and «j,g associated with values of i, lying within successive 
class-intervals of its range. In general the value of r^^g would 
not be the same (or approximately the same) for all such tables, 
but would exhibit some systematio change as the value of x^ 
increased. Hence r,2., should be regarded, in general, as of the 
nature of an averse correlation : the cases in which it measures 
the correlation between a:,,g and x^^ for every value of x, (cf. 
Chap. XVI.) are probably exceptional. The process for deter- 
mining partial associations (cf. Chap. IV.) is, it will be remembered, 
thorough and complete, as we always obtain the actual tables 
exhibiting the association between, say, A and B in the universe 
of Cs and the universe of y's : that these two associations may 
differ materially, is illustrated by Example i. of Chap. IV. 
(pp. 46-6). It might sometimes serve as a useful check on 
partial-correlation work to reclassify the observations by the 
fundamental methods of that chapter. 



REFERENCES. 

The preceding chapter is written from the Btandpoiut of refa. 3 and i, uid 
with the notation and method of ref. 5. The theory of correUtion for levenl 
variables was developed bj Edgeworth and Pearson (refg. 1 and 2) from the 
itandpoint of the "normal" distribution of frequency (^. Chap. XVI.}. 

Tll«(»7. 

(1) Edobworth, F. Y., "On Correlated Areragee," Fkil. Mag., 6th Seriet 

vol. zzziv., 1892, p. 194. 
(3) PbaebON, Karl, "Begrassion, Heredity, and Panmiiia," Phil. Trant. 

Soy. Soe., Series A, vol. clxiirii., 1896, p. 263. 
(it) Ydlb, G. U., " On the Significance of Bravaia' FormtiUe for Begreuion, 

etc, in the owe of Skew Correlation," iVoe. Soy. Soc., vol. li., 1897, 

p. 477. 
(1) Y0U, G. U., "On the Theory of Correlation," Jmr. Sou. Slat, Soe., 

vd. k., 1897, p. 812. 
(6) YuLB, Q.V., "On the Theory of CorreUtion for any number of Variable* 

treated by a New SvBtem of Notation," /Vik. Soy. Soc., Series A, vol. 

lxiii.,1907, p. 182. 
(S) HooKEB, K. H., and G. U. Ydle, " Note on Estimating the Relative 

Influence of Two Variables upon a Thiid," Jour. Soy. Slat. Soc, vol. 

Izii., 1S06, p. 1B7. 



N Google 



PARTIAL COKRBLATIOK. 249 

ninstratlve AppUcationa of Economic Interest. 

(7) Ydlb, G.V., "Ad Investigatioii into the Csnaes of Change* in Pauperism 

in En^uid, etc. 'Vour. Soy. Slat. Soc., vol. Iiii., 1899, p. SIB. 

(8) HooKEB, R. H., " The Correlation of the Weather and the Crops," Jour. 

Roy. StaC. Soc, vol. hi., 1907, p- 1. 

EXERCISES. 

eans, standard -deTistions, and oorrelstions are 
n cwto. per acre, 
J,=aocumnUted temperature above 42° F, in spring, 
in a certain district ot England during 20 jeara, 

Jf, = 28-02 ffj^i-ia r,3= +0-80 

x,= i-n oj=rio ri,= -o-40 

J/, = 5BI ir,= 8fi r^= -0-56 

Find the partial correlations and the regreseian-eqaation for haycrop on spring 
rainfall and accunjulated temperature. 

2. (The following figures must be taken as an illnetration onl; : the data 
on which they were based do not refer to uniform times or areas. ) 

.7, — deaths of infanta under 1 year per 1000 births in same year (infantile 

mortality). 
Xj= proportion per thousand of married women ocCQpied for gain. 
^^3=: death -rate of persons over G years of age per 10,000. 
Zj = proportion per thousand of population living 2 or more to a room 

Taking the figures below for SO urban areas in England and Wales, find the 
partial correlations and the rsgression -equation for infantile mortality on the 
other factors. 

Mt^Ui oi= 20-0 r,j=+0-« ra=+016 
Jfg=l&8 ffi= 71-9 rij=-fO-78 rM=-0-a7 

J(,=H8 ff,= 22-4 r„= + 0-20 r„= +0'23 

Mt = 205 ffj = 130-0 

3. If all the oorrelatdons of order zero are equal, say = r, what are the values 
of the partial correlations of sDccessive orders I 

Under the same condition, what is the limiting value of r if all tlie equal 
correlations are negative and n variables have been observed T 

4. What ie thecorrelation between x^«Bnd ^] T 

G. Write down from inspection the valaes o( the partial correlations for the 
three variables 

X„ Za, and J,=a.Zi + J.Z,. 
Check the answer to Qu. 7, Chap. XI., by working out the partial 
correlationa. 

D. If the relation 

a.ic,+b.Xi+e.Xt=0 
holds for all sets of valnei of x,, a^, and 3%, what must the partial correlationa 
bet 

, Chap. XL, by working out tiie partial 



PART III.— THEORY OF SAMPLING. 

CHAPTER XIII. 

SIMPLE 8AHPLIHG OF ATTBIBUTES. 

1. The problem of the present Part— 2. The two chief diviaioas of the theoiy 
of aampliiig — 3. Limitation of the diBoiisBion to the case of simple 
sampling — 4. DeQaition of the chance of suoMsa or failure of a giren 
event— h. Detennination of the mean and standard -deviation of the 
number of BQCceaseB in n events — S. The same for the proiiortioa of 
successes in n events : the standard -deviation of simple sampling u a 
measure of unreliability, or its reciprocal as a measure of precision — 7. 
VeriBcatioD of the theoretical results by experiment — 8. More detailed 
discusaiun of the assumptionB on ivbicii the formula for the standard- 
deviation of simple sampling is baaed— S-10. Biological ewes to 
which the theory is directly applicable — 11. Standara- deviation of 
simple sampling when the numbers of observations in the samples 
varj~12. Approiimate value of the standard -deviation of simple 
sampling, and relation between mean and standard -deviation, when 
the chance of success or failure is very small— 13. Use of the standard- 
deviation of simple sampling, or standard error, for checking and 
controlling the interpretation of statistical results, 

1. On Beveral occaeions in the preceding cfaaplere it has been 
pointed out t^at small differences between Btatistical measures like 
percent^eB, averages, measures of dispersion and so forth cannot 
in general be assnmed to indicate the action of definite and assign- 
able causes. Small differences ma; easily arise from indefinite 
and highly complex causation such as determines the fluctuating 
proportions of heads and tails in tossing a coin, of black balls in 
drawing samples from a bag containing a mixture of black and 
white balls, or of cards bearing measuremeats within some given 
class-interval in drawing cards, say, from an anthropometric record. 
In 100 throws of a coin, for example, we may have noted 66 heads 
and only 44 tuls, but we cannot conclude that the coin is biassed : 
on repeating our throws we may get only 48 heads and 62 tails. 
Similarly, if on measuring the statures of 1000 men in each of 
two nations we find that the mean stature is slightly greater for 



SIMPLE SAMPLINQ OF ATTEIBUTBS. 251 

nation A than foi* nation B, we cannot necessarily conclude that 
the real mean stature ie greater in the case of nation A : possibly 
if the observations were repeated on different samples of 1000 
men the ratio might be reversed. 

2, The theory of such fluctuations may be termed the theoij 
of Bamplmg, and there are two chief sections of the theory corre- 
sponding to the theory of attributes and the theory of variables 
respectively. In tossing a coin we only classify the results of the 
tosses as heads or tails ; in drawing balls from a mixture of black 
and white balls, we only classify the balls drawn as black or as 
white. These cases correspond to the theory of attributes, and 
the general case may be represented as the drawing of a sample 
from a universe containing both A'% and a's, the number or 
proportion of A'a in successive samples being observed. If, on the 
other hand, we put in a bag a number of cards bearing difiercnt 
values of some variable X and draw aaniple batches of cards, we 
can form averages and measures of dispersion for the successive 
batches, and these averages and measures of dispersion will vary 
slightly from one batch to another. It associated measures of 
two variables .^ and Tare recorded on each card, we can also form 
correlation -coefficients for the different batches, and these will vary 
in a similar manner. These cases correspond to the theory of 
variables, and it is the function of the theory of sampling for suobt^ 
cases to inform us as' to the fluctuations to be expected in the 
averages, measures of dispersion, correlation-coefficients, etc., in 
successive samples. In the present and the three following 
chapters the theory of sampling is dealt with for the case of 
attributes alone. The theory is of great importance and interest, 
not only from its applications to the checking and control of 
statistical results, but also from the theoretical forms of frequency- 
distribution to which it leads. Finally, in Chapter XVII. one or 
two of the more important cases of the theory of sampling for 
variables are briefly treated, the greater part of the theory, owing 
to its difficulty, lying somewhat outside the limits of this work. 

3. The theory of sampling attains its greatest simplicity if 
every observation contributed to the sample may be regarded as 
independent of every other. This condition of independence 
holds good, e.g., for the tossing of a coin or the throwing of a die ; 
the result of any one throw or toes does not affect, and is un- 
affected by, the results of the preceding and following tosses. 
It does not hold good, on the other hand, for the drawing of balls 
from a bag: if a ball be drawn from a bag containing 3 black 
and 3 white balls, the remainder may be either 2 black and 3 
white, or 3 white and 3 black, according as the first ball was 
black or white. The result of drawing a second ball is therefore 



252 IHEOBT or STATISTICS. 

dependent on the result of drawing the first. The diaturbance 
can only be eliminated hy drawing from a bag containing a 
number of balls that is infinitely large compared with the 
total number drftwn, or by returning each ball to the bag before 
drawing the next. In this chapter our attention will be confined 
to the case of independent sampling, as in coin-tossing or dice- 
throwing — the simplest cases of an artificial kind suitable for 
theoretical study and exjwrimcntal verification. For brevity, we 
may refer to such cases of sampling as simple BampUng : the 
implied conditions are discussed more fully in § 8 below. 

i. If we may regard an ideal coin as a uniform, homogeneous 
circular disc, there is nothing which can make it tend to f^l more 
often on the one side than on the other ; we may expect, there- 
fore, that in any long series of throws the coin will fall with 
either face uppermost an approximately equal number of times, 
or with, say, heads uppermost approximately half the times. 
Similarly, if we may regard the ideal die as a perfect homogeneous 
cube, it will tend, in any long series of throws, to fall with each 
of its six faces uppermost an approximately equal number of 
times, or with any given face uppermost one-aixth of the whole 
number of times. These results are sometimes expressed by 
saying that the chance of throwing heads (or tails) with a coin is 
1/2, and the ehajux of throwing six (or any other face) with a die 
is 1/6. To avoid speaking of such particular instances as coins 
or dice, we shall in future, using terms which have become 
conventional, refer to an event the chance of succeu of which is p 
and the chance otfailwre q. Obviouslyj(-t-i2'= 1. 

5. Suppose we take N samples with n events in each. What 
will be the values towards which the mean and standard-deviation 
of the number of successes in a sample will tend 1 The mean is 
given at once, for there are N-n events, of which approximately 
pNn will be successes, and the mean number of successes in a 
sample will therefore tend towards pn. As regards the standard- 
deviation, consider firat the single event {n = \). The single 
event may give either no successes or one success, and will tend 
to give the former qN, the latter pN, times in N trials. Take 
this frequency distribution and work out the standard-deviation 
of the number of successes for the single event, aa in the case of 
an arithmetical example : — 

Frequency/. 
qX 
pS 



A- 


/f- 


pie 


iL 


pN 


,.Cf;K,gle 





SIMPLE SAMPLINQ OF ATTRIBUTES. 253 

We have therefore M=p, and 

a\=p-p'=pq. 
But the number of successes in a group of n such events is the 
sum of successes for the single events of which it is composed, 
and, all the events being indepeDdent, we have therefore, by the 
usual rule for the standard-deviation of the sum of independent 
variables (Cliap. XI. g 2, equation (2)), cr, being the standard- 
deviation of the number of successes in n events, 

'^ = «P« (1) 

This is an equation of fundamental' importance in the theory of 
sampling. The student should particularly bear in mind that 
the standard-deviation of the number of successes, due to 
fluctuations of simple sampling alone, in a group of n events 
varies, not directly as it, but as the square root of n. 

6. In lieu of recording the absolute number of successes in each 
sample of n events, wo might have recorded the proportion of 
such successes, i.e. l/nth of the number in each sample. As this 
would amount to merely dividing all the figures of the original 
record by n, the mean proportion of successes — or rather the value 
towards which the mean tends to approach — must bep, and the 
standard-deviation of the proportion of successes'*, be given by 

>l = .rlln^=pqln . . . . (2) 
The standard-deviation of the proportion of successes in samples 
of such independent events varies therefore inversely as the square 
root of the number on which the proportion is calculated. Now 
if we regard the Observed proportion in any one sample as a 
more or less unreliable determination of the tnie proportion in 
a very large sample from the same material, the atiandard-devia- 
tion of sampling may fairly be taken as a measure of the 
unreliability of the determination — the greater the standard- 
deviation, the greater the fluctuations of the observed proportion, 
although the true proportion is the same throughout. The 
reciprocal of the standani-deviation (l/»), on the other hand, may 
be regarded as a measure of reliability, or, as it is sometimes 
termed, preciiion, and consequently the reliability or preei»ion of\- 
an observed proportion varies at the square root of the nvmber of 
obiervatiom on tohieh it is bated. This is ^ain a very important 
rule with many practical applications, but the limitations of the 
case to which it applies, and the exact conditions from which it 
has been deduced, should be borne in mind. We return to this 
point again below (§ 8 and Chap. XIV.). 

7. Eiperiments in coin tossing, dice throwing, and so forth 
have been carried out by various persons in order to ofatun ex- 



254 



THBOBY OP STATISTICS. 



perimental verification of these results. The following will serve 
as illustrations, but the student ia strongly recommended to 
carry out a few series of such eiperimenta personally, in order to 
acquire confidence in the use of the theory. It may be as well 
to remark that if ordinary commercial dice are to be used for the 
trials, care should be taken to see that they are fairly true cubes, 
and the marks not cut very deeply. Cheap dice are generally 
very much out of truth, and if the marks are deeply cut the 
balance of the die may be sensibly affected. A convenient mode 
of throwing a number of dice, suggested, we believe, by the late 
Professor Weldon, is to roll them down an inclined gutter of 
corrugated paper, so that they roll across the corrugations. 

(1) (W. F. R. Weldon, cited by Professor F. Y. Edgeworth, 
SiKycl. Brit., 10th edn., vol. xiviii. p. 282. Totals of the columns 
in the table there given.) 

Twelve dice were thrown 4096 times ; a throw of 4, 5, or 6 points 
reckoned a success, therefore p = q = 05, Theoretical mean M= 6 ; 
theoretical value of the standard-deviation <r,2= \/0-5 x 0'5 x 12 = 
1-732. 

The following was the frequency- distribution oljserved : — 



SuccBsaes. Frequency. 
— 



430 
731 
948 



SuooBsses. Frequency. 
7 847 



Total 



4096 



Mean M= 6'139, standard-deviation <r= 1712. The proportion of 
successes ia 6-139/12 = 0-512 instead of 0-5. 

(2) (W. F. R. Weldon, loc. dt., p. 289. Totals of columns of 
the table given.) 

Twelve dice were thrown 4096 times ; only a throw of 6 was 
counted a aucceas, sop =1/6, ^ = 5/6. Theoretical mean if =2, 
standard-deviation tr= Vl/6 x 5/6 xT2= 1'291. 

The following was the observed frequency-distribution : — 



Successes. Frequency. 



Frequency. 



Total 4096 



SIMPLE SAMPLING OP ATTEIBDTBS. 266 

Uetin Jf~2'000, ataodard-deTiation o- => 1'296. Actual proportion 

ot Buoceesos 2-00/12 -0-1667, agreeing with the theoretioal value 
to the fourth place of decimala. Of course such very close 
agreement is accidental, and not to be always expected. 

(3) {(}. U. Yule.) The following may be taken as an illustra- 
tion baaed on a smaller number of obaervations. Three dice were 
thrown 648 times, and the numbers of 5's or 6's noted at 
each throw, p^l/3, 9 '•2/3. Theoretical mean I, Standard- 
deviation, 0816. 

Frequency-distribution observed :— ^ 

FT»qnenoy, 



Total 648 

Mr- 1034, (T = 0-823. Actual proportion of Buccesses 0-345. 

For other illuBtrationa, Bome of which are cited in the questions 
at the end of this chapter, the student may be referred to the 
list of references on p. 269. The student should notice that in 
all the distributions given a range of six times the standard- 
deviation includes either all, or the great bulk of, the observations, 
as in most frequency-distributious of the same general form. We 
shall make use of this rule below, g 13. 

8. In deducing the formulte (1) and (2) for the etandard- 
deviations of litnple sampling in the caaes with which we have 
been dealing, only one condition has been explicitly laid down as 
necessary, viz, the independence of the severaJ drawings, tosaings, 
or other eventB composing the sample. But in point of fact this 
is not the only nor the most fundamental condition which has 
been explicitly or implicitly assumed, and it is necessary to realise 
all the conditions in order to grasp the limitations under which 
alone the formulee arrived at will hold. Supposing, for example, 
that we observe among groups of 1000 persona, at different times 
or in difTerent localities, various percentages of individuals 
possessing certain characteristics —dark hair, or blindness, or 
insanity, and so forth. Under what conditions should we 
expect the observed percentages to obey the law of sampling 
that we have found, and show a standard-deviation given by 
equation (2) ? 

(o) In the first place we have tacitly assumed throughout the 
preceding work that our dice or our coins were the same set or 



266 THEOBT OP STATISTICS. 

identically BJmilar throughout the experiment, ao that the chance 
of throwing " heads " with the coins or, Bay, " six " with the dice 
was the same throughout : we did not commence an experiment 
with dice loaded in one way and later on take a fresh set of dice 
loaded in another way. Consequently if formula (2) is to hold 
good in our practical case of sampling there muBt not be a 
difference in any essential respect — i.e. in any character that can 
affect the proportion observed — between the localities from which 
the observations are drawn, nor, if the observations have been 
made at different epochs, must any essential change have taken 
place during the period over which the observations are spread. 
Where the causation of the character observed is more or less 
unknown, it may, of course, be difficult or impossible to say what 
differences or changes are to be r^arded aa essential, but, where 
we have more knowledge, the condition laid down enables us to 
exclude certain cases at once from the possible applications of 
formula (1) or (2). Thus it is obvious that the theory of simple 
sampling cannot apply to the variations uf the death-rate in 
localities w,ith populations of different age and sex compositions, 
nor to death-rates in a mixture of healthy and unhealthy districtfi, 
nor to death-rates in successive years during a period of con- 
tinuously improving sanitation. In all such cases variations 
due to definite causes are superposed on the fluctuations of 
sampling. 

{b) In the second place, we have also tacitly assumed not 
only that wo were using the same set of coins or dice throughout, 
80 that the chances p and q were the same at every trial, but 
also that all the coins and dice in the set used were identically 
similar, so that the chances p and q were the same for every coin 
or die. Consequently, if our fonnulte are to apply in the practical 
case of sampling, the conditions that regulate the appearance of 
the charact«r observed must not only be the same for every 
sample, but also for every individual in every sample. This is 
again a very marked limitation. To revert to the case of death- 
rates, formulee (1) and (2) would not apply to the numbers of 
persona dying in a series of samples of 1000 persons, even if these 
samples were all of the same age and sex composition, and living 
under the same sanitary conditions, unless, further, each sample 
only contained persons of one sei and one age. For if each 
sample included persons of both sexes and different ages, the 
condition would be broken, the chance of death during a given 
period not being the same for the two sexes, nor for the young 
and the old. The groups would not be homogeneous in the sense 
required by the conditions from which our formulee have been 
deduced. Similarly, if we were observing hair-colours, our formnlu 



SIMPLE SAMPUKG OP ArrRIBUTKS. 257 

would aot apply if the samples were compounded by always 
taking one person from district A, another from district B, and 
BO on, tliese districts not being similar as r^ards the distribution 
of hair-oolour. 

The above conditions were only tacitly araumed in our previous 
work, and consequently it has been necessary to emphasise them 
specially. The third condition was aiplioitly stated: (c) The 
individual "events," or appearances of the character observed, 
must be completely independent of one another, like the throws 
of a die, or sensibly so, like the drawings of balls from a hag 
containing a number of balls that is very large compared with 
the number drawn. Reverting to the illustration of a death-rate, 
our formulee would not apply even if the sample populations 
were composed of persons of one age and one ses, if we were 
dealing, for example, with deaths from an infectious or contagious 
disease. For if one person in a certain sample has contracted 
the disease in question, he has increased the possibility of others 
doing so, and hence of dying from the disease. The same thing 
holds good for certain classes of deaths from accident, e.g. railway 
accidents due to derailment, and explosions in mines : if such an 
accident is fatal to one person it is probably fatal to others also, 
and consequently the annual returns show large and more or 
less erratic variations. 

When we speak of simple Bamplii^ in the following pages, the 
term is intended to imply the fulfilment of all the conditions (a), 
(b), and (c), all the samples and all the individual contributions to 
each sample being taken under precisely the same conditions, 
and the individual " events " or appearances of the character being 
quite independent. It may be as well expressly to note that we 
need not make any assumption as to the conditions that determine 
p unless we have to estimate -Jnpq a priori. If we draw a 
sample and observe in it the actual proportion of, say, A'b : 
draw anotber sample under precisely the same conditions, and 
observe the proportion of A'a in the two samples t(^ther;,add 
to these a third sample, and so on, we will find th&tp approaches 
— not continuously, but with some fluctuations — closer and closer 
to some limiting value. It is this limiting value which is to be 
used in our formulfe— the value of p that would be observed in 
a very large sample. The standard-deviation of the number of 
sixes thrown with n dice, on this understanding, may be -Jnpq, 
even if the dice be out of truth or loaded so that p is no longer 
1/6. Similarly, the standard-deviation of the number of black 
balls in samples of n drawn from an infinitely large mixture of 
black and white balls in equal proportions may be •Jnpq even 

11 h;Ic 



368 THEORY OF BTATISTICB, 

if p is, say, 1/3, and not 1/2 owing to the black balls, for some 
reason, tending to slip through our fingers. {Cf. Chap. XIV. 

9. It is evident that these oonditiooB very much limit the 
field of practical caaes of an economic or sociological oharact«r 
to which formulte (1) and (2) can apply without considerable 
modification. The formulae appear, however, to hold to a high 
degree of approximation in certain biological cases, notably in 
the proportions of offspring of different types obtained on crossing 
hybrids, and, with some limitations, b> the proportions of the 
two sexes at birth. It is possible, accordingly, that in these cases 
all the necessary conditions are fulfilled, but this is not a necessary 
inference from the mere applicability of the formulee (c/. Chap. 
XIV. § 15). In the case of the sex-ratio at birth, it seems 
doubtful whether the rule applies to the frequency of the seiea in 
individual families of given numbers (ref. 9), but it does apply 
fairly closely to the aex-ratios of births in different localities, 
and still more closely to the ratios in one locality during 
successive periods. That is to say, if we note the number of 
males in a series of groups of n births each, the standard-deviation 
of that number is approiimately ^wp g, where p is the chance 
of a male birth ; or, otherwise, ■Jpqjn is the standard-deviation 
of the proportion of male births. We are not able to assign an 
a priori value to the chance ;> as in the case of dice-throwing, 
but it is quite sufficiently accurate for practical purposes to use 
the proportion of male births actually observed if that proportion 
be based on a moderately large number of observations. 

10. In Table VI. of Chap. IX. (p. 163) was given a correlation- 
table between the total numbers of births in the registration districts 
of England and Wales during the decade 1881-90 and the pro- 
portion of male births. The table below gives some similar figures, 
based on the same data, for a few isolated groups of districts con- 
taining not less than 30 to 40 districts each. In both tables the 
drop in dispersion as we pass from the small to the large districts 
is extremely striking. The actual standard-deviations, and the 
standard-deviations of simple sampling corresponding to the mid- 
numbers of births, are given at the foot of the table, and it will 
be seen that the two ^ree, on the whole, with surprising closeness, 
considering the small nunibers of observations. The actual 
standard-deviation is, however, the larger of the two in every case 
but one. The corresponding standard-deviations for Table VI. of 
Chap. IX. are given in Qu, 1 at the end of this chapter, and show 
the same general agreement with the standard-deviations of simple 
sampling ; the actual standard-deviations are, however, again, as 
a rule, slightly in excess of the theoretical values. 



3IMPLB SAMPLIN6 OF ATTHIBUTE8. 



Table ahmein^ Freqafaeiea of Eegtslration Dial/rieta in England and 1V<^a wUh 

Dif'trtnt Satioa of Male to Total Birlhi during the Deatde 1881-90, /6t 
Onmpi of DittrUts utith the Number* of BirlM in the Decode lying bttmem 
Certain LiraUa. [Data baaed on llecennial SttppUment to Fifty-fifth Annual 
BeportqftheBegUtTar-SeneralforSnglanda'Tui U^alea.] 



Male Births 
par ThoDfland 
?otal Births. 


Number of BirthB in Decade. 


1500 
to 
2600. 


8600 
4000. 


4500 
6000. 


10,000 
16,000. 


16,000 

to 
20.000. 


so.ooo 

60,000. 

s 

1 


60,000 
90,000. 


48-2- 3 

492- 3 
4B4- 5 
496- 7 
498- S 
600- 1 
502- 3 
604- 6 
608- 7 
BOS- 9 
610- 1 
612- 3 
614-6 
518- 7 
618- 9 
520- 1 
622- 3 
624- 6 
628- 7 
528- 9 
630-1 
682- 3 
534- 6 
638- 7 


1, 


3 
1 
4 
3 

5 
3 
3 

5 
2 

1 
2 

1 

1 


2 
3 
3 
3 
3 
B 
2 
3 
3 

1 
3 


10 

15 
10 


4 
8 
4 
6 

2 

2 

1 


6 
10 
12 
6 
2 


Total 
Mean 

Theo. at. deviation-, 
corresponding to 
mean^iS J 


36 
508-2 
12-8 

11-2 

8-2 


38 
609-5 
8-53 

8-18 

2-6 


40 

510-2 
7 12 

7-26 


73 

610-6 

4-98 

4-47 

2-2 


33 

510-3 

3-87 

3-78 

0-8 


43 

609-0 
3-22 

2-50 

2-0 


36 

507-8 
2-20 

1-89 

11 



of this eipreaiian is explained in § 10 of Chap. XIV. 



260 THKOBT OF STATISTICS. 

The student should note that ia both cases the standard-devia- 
tions given are standard-deviationB of the proportion of male 
births jier 1000 of all bvrths, that is, 1000 times the values given 
by equation (2), Theae values are given by simply substituting 
the proportions per lUOO for^ and q in the formula. Thus for 
the first column of Table I. the proportion of males is 508 per 
1000 births, the mid-number of births 2000, and therefore — 



'»-\ 2000 ;=ii2. 



11. In the above illustration the difficulty due to the wide 
variation in the number of births n in different districts has been 
surmounted by grouping these districts in limited class intervals, 
and assuming that it would be sufficiently accurate for practical 
purposes to treat all the districts in one class as if the aes-ratioB 
had been based on the mid-numbers of births. Given a sufficiently 
large number of observations, such a. process does well enough, 
though it is not very good. But if the number of observations 
does not exceed, perhaps, 50 or 60 altogether, grouping ia 
obviously out of the question, and some other procedure must be 
adopted. 

Suppose, then, that a scries of samples have been taken from 
the same material, /, samples containing n^ individuals or observa- 
tions each, /g containing n^, /, containing Mj, and so on : What 
would be the standard-deviation of the observed proportions in 
these samples 1 Evidently the square of the atandani-deviation 
in the first group would hepq/n^, in the second p^/n^, and so on : 
therefore, as the means tend to the same values in all the groups, 
we must have for the whole series — 

^.S'=pq(^ + -^^ + ^+ ....). 

But if ZT be the harmonic mean of «j n^ n, . . . . 

*.i+|^i+ .... 

and accordingly 

*-f (3) 

That is to say, where the number of observations varies from one 
sample to another, the harmonic mean number of observations in 
a sample must be substituted for n in equation (2). 

Thus the following percentt^s (taken to the nearest unit) of 



aiMi-LE SAMPUNO OF ATTBIBDTES. 



261 



albinos were obtained in 121 litters from hybrida of Japanese 
waltzing mice by albinos, crossed inUr le (A. D. Darbishire, 
Biomelrika, iii. p. 30) ; — 

Peroentage. Frequenoy. 



U 



33 



40 



13 



Percentage. Frequency. 
40 3 



43 



100 



The distribution is very irregular owing to the small iiumbers in 
the litters, and the standard-deviation is 23'09 per cent. The 
numbers of litters of different sizes were given in g 27 of Chap. 
VII. p, 128, and the- harmonic moan siKe of litter was found to be 
3-53. The expected proportion of albinos is 25 per cent., and 
hence the standard-deviation of sampling is 



/ 25 X 75 Y 
V 3'53 / ' 



23-05, 



in very close agreement with the actual value. The proportion 
of albinos amongst all the offaprii^ together was 24-7 per cent. 

12. If one of the two proportions p and q become very small, 
equation (1) may be put into an approximate form that is very 
useful. Suppose p to be the proportion that becomes very small, 
BO that we may n^lect p^ compared with p : then 

pq=p-p^-mp approximately, 
and consequently we have approximately 
(r,= ^p= -JM 

That ia to say, if the proportion of 
standard-deviation of the nwrnber of »uax»ee 
the mean nvmiber of miiceeisei. Hence we ' 
deviation of samplir^ even though p be unknown, provided only 
we know that it is small. 

Thus (ref. 14) in 10 Prussian army corps in 20 years (1875- 
1894) there were 122 men killed by the kick of a horse, or, on an 
averE^e, there were 0-61 deaths from that cause in each army 
corps annually. From equation (4) we accordingly have for the 
standard-deviation of simple sampling 

<r-{0-61)' = 0-78. ,,,-,GoO<^lc 



(4) 

be imall, the 
■> the tquare root of 
1 find the standard- 



262 THKOBY OF STATISTICS. 

The frequenoy-diBtribution of the number of deaths per army 
corps per annum was 



MthB. 


FMqiMnoy. 





109 


1 


65 



0^ = 0-6079 
<r = 0-78 

— an almost exact agreement with the Btandard-deviation of simple 
sampling. 

13. We ma; now turn from these verifications of the theoretical 
results for various special cases, to the use of the formulae for 
checking and controlling the interpretation of statistical results. 
If we observe, in a statistical sample, a certain proportion of 
object* or individuals possessing some given character— say A's — 
this proportion differing more or less from the proportion which 
for some reason we expected, the question always arises whether 
the difference may be due to the fluctuations of simple sampling 
only, or may be indicative o! definite differences between the 
conditions in the universe from which the sample has been drawn 
and the assumed conditions on which we based our expectation. 
Similarly, if we observe a different proportion in one sample from 
that which we have observed in another, the question again arises 
whether this difference may be due to fluctuations of simple 
sampling alone, or whether it indicates a difference between the 
conditions subsisting in the universes from which the two samples 
were drawn : in the latter case the difference is often said to be 
B^fnificant. These questions can be answered, though only more 
or less roughly at present, by comparing the observed difference 
with the standard -deviation of simple sampling. We know 
roughly that the great bulk at least of the fluctuations of samp- 
ling lie within a range of ± three times the standard-deviation ; 
and if an observed difference from a theoretical result greatly 
exceeds these limits it cannot be ascribed to a fluctuation of 
"simple sampling" as defined in g 8 : it may therefore be algnifi- 
oant. The "standard-deviation of simple sampling" being the 
basis of all such work, it is convenient to refer to it by a shorter 
name. The observed proportions of A'a in given samples being 
regarded as differing by larger or smaller errors from the true 
proportion in a very large sample from the same material, the 



SIMPLE SAMPLING OF ATTBIBUTES. 263 

" standard -deviation of simple sampliog " may be regarded as a 
measure of the magnitude of such errors, and may be called ac- 
cordingly the standard error. 

Three principal cases of comparison may be distinguished. 

Case I. — It is desired to know *hether the deviation of a certain 
observed number or proportion from an expected theoretical value 
is possibly due to errors of sampling. 

In this case the observed difference is to be compared with the 
standard error of the theoretical number or proportion, for the 
number of observations contained in the sample. 

Example i, — In the first illustration of §7, 25,145 throws of a 4, 
5, or 6 were made in lieu of the 24,576 expected (out of 49,152 
throws altogether). The excess is 569 throws. Is this excess 
possibly due to mere fluctuations of sampUng^ 

The standard error is 

(T= Vi X i X 49152 
= 110-9. 

The deviation observed is 5*1 times the standard error, and, 
praotically speaking, could not occur aa a fluctuation of simple 
sampling. It may perhaps indicate a slight bias in the dice. 

The problem might, of course, have been attacked equally well 
from the standpoint of the proportion in lieu of the absolute 
number of I'b, 5'b, or 6'b thrown. This proportion ia 051 16 instead 
of the theoretical 05000, difference in excess 00116. The 
standard error of the proportion is 



V. 



ixixisk-"-™^^"' 



and the difference observed bears the same ratio to the standard 
error as before, as of course it must. 

Example ii.— (Data from the Second Report of the Evolution 
Committee of the Royal Society, 1905, p. 72.) 

Certain crosses oE Pisum tativwm gave 5321 yellow and 1804 
green seeds. The expectation is 25 per cent, of green seeds, or 
1781. Can the divej^ence from the exact theoretical result have 
arisen owing to errors of sampling only 1 

The numerical difference from the eipeoted result is 23. The 
standard error is 



0-= V0-26x0-75x7125 = 36-8. 

Hence the divergence from theory is only some 3/5 of the 
standard error, and may very well have arisen owing simply to 
fluctuations of sampling. Gi.)twlc 



264 THEORY OF STATISTICS. 

Worldngfrom the observed ^JToportion of green seeds, viz. 0-2532 
ioBtead of the theoretical 0*26, we have 

»= Vl>-25xO-75/7125-00051, 

and Bimilarlj the divergeooe from theory is only some 3/5 of the 
standard error, as before. 

It should be noted that this method must not be used as a test 
of association by comparing the difference of {AB) from (A)(B)/M 
with a standard error calculated from the latter value as a 
"theoretical number," for it is not a theoretical number given 
a priori as in the above illuBtrationa, and A and B are themselves 
liable to errors of sampling. If we formed an association-table 
between the results of tossing two coins N times, a-= JN.\.^ 
would be the standard error for the divergence of {AB) from the 
a priori value n/i, not the standard error for differences of (AB) 
from {A){B)IN, {A) and {B) being the numbers of heads thrown 
in the case of the first and the second coin respectively. 

Case II, — Two samples from distinct materials or different 
universes give proportions of A'a pj and p^ the numbers of 
observations in the samples being jij and »^ respectively, {a) Can 
the difference between the two proportions have arisen merely as a 
fluctuation of simple sampling, the two universes being really 
similar as regards the proportion of A'a therein T (*) If the 
difference indicated were a real one, might it vanish, owing to 
fluctuations of sampling, in other samples taken in precisely the 
same way 1 This case corresponds to the testing of an associatioD 
which is indicated by a comparison of the proportion of A'a amongst 
^s and ^'8 

(a) We have no theoretical expectation in this case as to the 
proportion of A'a in the universe from which either sample has 
been taken. 

Let us find, however, whether the observed difference between p^ 
and p^ may not have arisen solely as a fluctuation of simple 
sampling, the proportion of A's being really the same in both cases, 
and given, let us say, by the (weighted) mean proportion in our 
two samples tt^ther, i.e. by 

^«- «, + «, 
(the best guide that we have). 

Let t^ tj be the standard errors in the two samples, then 

If the samples are simple samples in the sense of the previous 
work, then the mean difference between pj and p^ will be aero, 



SIMPLE SAMPLING OF ATTRIBUTES. 265 

and the standard error of the difference c^j, the sampleB being 
independent, will be given by 

■•'•Mbi) ■ ■ • ■ w 

If the observed difference ia less than some three times Cjj it 
may have arisen as a fluctuation of simple sampling only. 

(b) If, OQ the other hand, the proportions of A's are not the same 
in the material from which the two samples are drawn, but p^ and 
p^ are the true values of the proportions, the standard errors of 
sampling in the two cases are 

and consequently 

«.,=M'+M^ . . . . (6) 

If the difference between p^ and p^ does not exceed some three 
times this value of «jj, it may be obliterated by an error of simple p^ 
sampling on taking fresh samples in the same way from the same 
material. 

Further, the student should note that the value of t^^ given by 
equation (6) is frequently employed, in lieu of that given by 
equation (5), for trating the significance of an observed difference. 
The justification of this ustige we indicate briefly later (Chap. 
XIV, § 31. Here it is sufficient to state that, if n be large, 
equation (6) gives approximately the standard-deviation of the 
true values of the difference for a given observed value, and hence, 
if the observed difference is greater or leas than some three times 
the value of tjj given by (6), it ia hardly possible that the true 
value of the difference can be zero. The difference between the 
valuM of «,2 given by (5) and (6) is indeed, as a rule, of more 
theoretical than practical importance, for they do not differ largely ^ 
unless pj and pj differ largely, and in that case either formula will 
place the difference outside the range of fluctuations of sampling. 

Example iii. — The following data were given in Qu. 3 of Chap. 
III. for plants of Lobelia fulgent obtained by cross- and self-fertilisa- 
tion respectively: — 

Parentue Gnwa-fertilised. Parentage 8elf- fertilised. 

Height — Height- 

Above Average. Below Average. Above Average. Below Average. 

17 17 12 22 

The figures indicate an association between tallness and crosa- 
fertilisatioQ of parentage. Is this association significant of some 
real difference, or may it have arisen solely as an " error of 



266 THEORf OP STATISTICS. 

sampling " ? The proportion of plants above average height in the 
two clasaes (cross- and self -fertilised) tc^ether is 29/68. The 
stAndard-deviation of the differences due to simple sampling 
between the proportiooa of " tall " plants in two sampW of 34 
observations each is therefore 



'» V68 68 34/ 



or 120 per cent. The actual proportions observed are 50 per 
cent, and 35 per cent. — difference 15 per cent. As this difference 
is only slightly in excess of the standard error of the difference, 
for samples of 34 observations drawn from identical material, no 
definite significance could be attached to it — if it stood alone. 

The student will notice, however, that all the other cases cited 
from Darwin in the question referred to show an association of 
the same sign, but rather more marked. Hence the difference 
observed may be a real one, or perhaps the real difference may be 
greater and may be partially masked by a fluctuation oE sampling. 
If 50 per cent, and 35 per cent, were the true proportioia in the 
two classes, the standard error of the percentage difference would 
be, by equation (6), 

/50xSO 35x65\ „. 

and consequently the actual difference might not infrequently be 
completely masked by fluctuations of sampling, so long as experi- 
ments were only conducted on the same smaU scale. 

Example iv. — (Data from J. Gray, Memoir on the Pigmentation 
Survey of Scotland, Jau/r. of the Royal Anthropological Institute, 
vol. xxivii., 1907.) The following are extracted from the tables 
relating to hair-colour of girls at Edinburgh and Glasgow : — 

Of Medium Total F«r cent. 

Hnir-colour. observed. Mediam. 
Edinburgh . . 4,008 9,743 411 

Glasgow . 17,529 39,764 44-1 

Can the difference observed in the percentage of girls of medium 
hair-colour have ariaen solely through fluctuations of sampling t 

In the two towns together the percentage of girls with medium 
hair-colour is 43'5 per cent. If this were the true percent^e, 
the standard error of sampling for the difference between per- 
centages observed in samples of the above sizes would be — 

.0-56 percent. ,,,,Goaglc 



SIMPLK SAMPLING OF ATTRIBUTES. 267 

The actual difference is 3-0 per cent., or over 5 times this, aod 
could not have arisen through the chances of simple aampling. 

If we asBume that the difference is a real one and calculate the 
standard error by equation (6), we arrive at the same value, viz. 
0'56 per cent. With such large samples the difference could not, 
accordingly, be obliterated by the fluctuations of simple sampling 

Case III. — ^Two samples are drawn from distinct material or 
different universes, as in the last case, giving proportions of 
A'6 p^ and p^, but in lieu of comparing the proportion p^ with 
P2 it is compared with the proportion of A'a in the two samples 
t4^ether, vh. p^ where, as before, 



ni + "s 



Bequired to find whether the differenoe between p^ and p^ can 
bave arisen as a fluctuation of simple sampling, p^ being the 
true proportion of A's in both samples. 

This case corresponds to the testing of an association which 
is indicated by a comparison of the proportion of A'a amongst 
the B'b with the proportion of A'a in the universe. The general 
treatment is similar to that of Case II., but the work is complioated 
owing to the fact that errors in p^ and Pg are not independent. 

If <„ be the standard error ot the difference between p^ and 
p^, we have at once 

4i = 4 + tl-2r„,.t^ti 

(11 1 ) 

rgj being the correlation between errors of simple sampling in 
p, and Pf,. But, from the above equation relating p^ to p^ 
and Pa, writing it in terms of deviations in p^ pj and p^ 
multiplying by the deviation in p^ and summing, we have, 
since errors in p^ and p^ are uncorrelated, 






Unless the difference between p^ and p^ exceed, say, some 
three times this value of <o„ it may have arisen solely by the 
chances of simple sampling. )oIc 



268 THKOKT OF BTAl^STICa. 

It will be observed that if n^ be very small compared with 
«^ <oi appro^hes, as it should, the standard error for a sample 
of n, observationB. 

We omit, in this case, the allied problem whether, if the 
difference between p^ and p^ indicated by the samples were 
real, it might be wiped out in other samples of the same siee 
by fluctuations of simple sampling alone. The solution is a 
little complex as we no longer have A=Pt,9{J{^\ + ^- 

Example v. — Taking the data of Example iii., suppose that 
we compare the proportion of tall plants amongst the offspring 
resulting from cross-fertilieationa (via. 50 per cent.) with the 
proportion amongst all offspring (viz. 29/68, or 42'6 per cent.). 
As, in this case, both the aubaamples have the same number 
of obserrations, Wj = iij = 34, and 



/29 39 ly rt.n 



or 6 per cent As in the working of Example iii., the observed 
difference ia only 1'25 times the standard error of the difference, 
and consequently it may have arisen as a mere fluctuation 
of sampling. 

ExampU vi. — Taking now the figures of Example iv., suppose 
that we had compared the proportion of girls of medium hair- 
colour in Edinburgh with the proportion in Glasgow and 
Edinbui^h together. The former is 41 '1 per cent., the latter 
43'5 per cent., difference 2'4 per cent. The standard error of 
the difference between the percentages observed in the sub- 
sample of 9743 observations and the entire sample of 49,507 
ohaervations is therefore 

= 0'45 per cent. 

The actual difference is over five times this (the ratio must, of 
^ course, be the same as in Example iv.), and could not have occurred 
as a mere error of sampling. 

BEFEREIfCES. 

The thftoij of aampline, for the cases de&It with in this chapter, is geDerallj 
treated by Bret 'ietermmiag the ftwiuenoy-distribution of the number of 
sncceagea in a sample. Thia freaQenoy-distributioD is not oonsidered till 
Chapter XV., and the student will m unable to ollow nmch of the litantare 
nntil he hae read that chapter. t. ) (H C 



SIMPLE SAMPLING OF ATTEIBTJTES. 269 

Experimental resnlta of dice throwing, coin tosalng, etc. 

(1) QuETELET, A., Letlrea .... 3tir la HUorie des jirobohilHia \ Bruzellea, 

1846 (English tmnalBtion b; 0. G. Downea ; C. & E. Layton, LondoD, 
1846). See especially letter liv. and the table od p. 374 of the 
French, p. 25S of th« English, edition. 

(2) Wbs'TKeqaabd, H., DU Onmdeftge der The&rie der SlatUtik; Fischer, 

Jena, ISSO. 

(5) EsoEWOBl'E, F. Y., Article on the " Law of Error " in tbe Tenth Edition 

of the Eni^clopadia Britann-iea, vol. xiviiL, 1902, p. 280. 
(4) Dakbibbirk, a, D., "Some Tables for illustrating Statiatical Correla- 
tion," Mem. tfnd Proc. of the Manchester Lit. and Phil. Soc, vol. li, 
1S07. 

General : and applications to sex-ratio of births. 

(6) PoissoN, 8. D., " Sur la proportdon des naissanoeB dea fillsB et des 

nrcone," Miwrirei de I'Acad. det Sciences, vol. iz.. 1829, p. 239. 
(Fni]cipall]i theoretical : the statistical illustrations verj slight. ) 
(8) Lbxis, W., Zwr Theorie der Masseneraehtinuiigeti, in der Twntehlichtn 
Oetellscha/l ; Freiburg, 1877. 

(7) Lexie, W., Abhandlungen sur Thearie der BevSlkemngs itnd Moralstati- 

stik; Fischer, Jena, 1803. (Contwns, with new matter, reprints of 
some of Professor Leiis' earlier papers in a form convenient for 
reference. ) 

(8) Ebobwobth, F. Y., "Methods of Statistics," Jour. Roy. Stai. Soc, 
jabiUe volume, 1SS&, p. 131. 

^KNN, John, TM Loaic of Chance 
(Gf. the data regarding the distribution of sexes in families 
to whicli reference woe made in S E>. ) 

(10) PEjt-BSOH, Karl, "Skew Variation in Homogeneous Material," I'hil. 

Ttan». Hoy. Soc., Series A, vol. clixivi., 18S5, p. S13. (Sections 2 to 
6 on the buiomial distribntion. ) 

(11) Bdobwortb, p. Y., " HiacellaneoUB Applications of tbe Calculus of 

Probabilities," Jow. Soy. Slat. Soe.,voIs. Ix., Iri., 18B7-8 (eapeoiallj 
part ii., vol. lii. p. 119). 

(12) VlQOB, H. D., and Q. U. Yule, "On the Sei-ratioa of Births in the 

Registration Diatricts of England and Wales, 1881-90," Jour. Soy. 
Slai. Soc., vol. liii., 1906, p. 576. (Use of the harmonic mean as In 
§11.) 

As regards the sex-ratio, reference msj also be made to papers in 
vols. V, and vi. of Biometrika by Heron, Weldon, and Woods. 

The law of small chances (§ 12). 
(18) PoisaoN, S. D., Bee/uTchei mr la probolnliU des jii^ements, etc. ; Paris, 
1887. (Pp. 205-7.) 

(14) BoRTKiwlTBCH, L. VON, Dot QckU der kUintn Zahlcn ; Teubner, 

Leipzig, 1898. 
(16) Student, "On the Error ofCounting with a Hiemaoytometer," Biomel- 
rifca, vol. v. p. S51, 1907. 

(15) BurnEKFosD, E,, and H. Obigbr, with a note by H. Bateuan, 

" The probability variations in the distribatjon of a particles," Phil, 
Mag., Series fi, vol. xx., 1910, p. 898. (The frequency of particles 
emitted during a small interval of tine follows the law of small 



THKOEY OP STATISTICS. 



EXEBCISES. 



1. (Bef. 4: total ofooluniDiofBll the 13 tables given.) 
Compare the actnol with the theoretical mean end Btandard-deriatioD for 
the fallowing i«aord of 6t>00 throws of 12 dice, 4, 6, or 6 being reckoned 

SoGoeaBea. Fraqaencj, 

7 ISfil 



Total SEOO 

2. (Bef. I.) 

Balli were drawn from a bu containing equal nnmbera of black and white 
balle, each ball being retumoa before drawing another. The recoide were then 
grouped bj conatlng the number of black biuls in consecutive 2's, 3'b, I's, 6's, 
ehi. The following give the distributiona so derived for grouping by C>'b, S's, 
and 7's. Compare oetual with theoretical means and standard-deviationa. 



Sncoeesea. 


WQreuping 
by Fives. 


(6) Grouping 
bjSiies. 


(e) Gronping 

by Serene. 



1 
2 


30 
125 
277 
224 
136 

27 


17 
6G 
IM 
192 
166 
69 


9 
34 

104 
161 
148 
96 
40 
4 


Tolal 


819 , 683 


58G 



8. (Bef. 2, p. 22.) 

Ten thousand drawinga of » ball from a bag containing equal numbers of 
black and white were made in the same manner as in the preceding example, 
and then grouped into 100 seta of 100. The following gires the reaulting 
frequency of different numbers of white balls. Compare mean and standarf 
deviation with theory. 



Number. Freqtieiicy. 



Number. Freqtiency. 



Number. Frequency, 



n,„N;.,i-,L.t)Ogle 



SIMPLE 8AMPLINQ OP ATTKIBDTBa. 



271 



4. ThepropaTtiaiiofaucoesMBiiitliedsUof Qu. 1 UO'SOS?. Findtliestand- 
•rd-deviatiaD of the pioportdon with the given numbeT of throws, uiil state 
whether you would reg&ra theezasBa of sacceeseaaiB probably gigaificant of bias 
in the dice. 

5. In the 4096 drawiup on which Qu. 2 is bused 2030 balls were black 
and 2QS6 white. la this divergence probably siguiBcant of bias ( 

6. If a^^nency-distributioiisuch us those of Questions!, 2, and 3 be given, 
show how n anip, if unknown, maybe approximately determined from the 
mean and standaid- deviation of the distribution. 

Find H andp in this way from the data of Qu. 1 and Qu. 8, 







Actual 
















deviation a. 


ot Sampling fj„ 


1 


508-2' 


11-80 


11-18 . 


2 


609-6 


6-79 


6-45 


3 


6100 


5-38 


5-00 




511-1 


6-08 


4-22 


G 


610-2 


a-67 




6,7 


5087 


4 -IS 


S-24 


8, », 10, 11 


608-7 


3-10 


2-99 


12, 18, 14 


608-4 


2-55 


2-25 


16 and upwards. 


508-2 


2 13 


1-85 



i_ 8. In a case of mice-breeding (see reference given in % 11) the harmonic 
mean number in a litter waa 4'73G, and the expected proportion of albinos 
SO per cent. Find the standard-deviation of simple sampling for the pro- 
portion of alhinos in a litter, and state whether the actual atandard-deviatiOQ 
(21 -68 per cent. ) probably indicates any real variation, or not 

9. (Data &om Report i,. Evolution Committee of the Royal Society, p. 17.) 
In breeding certain stocks 408 hairy and 126 glabrous plants were obtained. 
If the expectation is one-fourth glabrous, in the divergence significant, or might 
it have occurred as a Buctuation of sampling ) 

10. (Data of Example viii, and Qu. 6, Chap. III.) Is the association in 
either of the following coses likely to have arisen as a Auctuatlon of simple 
sampling T 

(o)UB) = 47 (A$)^12 {«B)=21 («^} = 8 

(i) (AB)^SOi f,AB) = 21i («S)=1S2 (aS) = 119 

11. The sex-ratio at birth is sometimes given by the ratio of male to female 
births, instead of the proportion of male to total births. If Zis the ratio, i.e, 

Z—p/q, show that the standard error of 2 is approa:imately (l + Z)^-, 
n being large, so that deviations are small compared with the mean. (The 
student may find it asefol to cefer to g 8, Chap. XL } 






r the 



■,Gt)Ogle 



CHAPTER XIT. 

SIMPLE SAMPLING COHTINTTED: EFFECT OF 
EEHOVING TEE LIMITAIIOIIS OF SIMPLE SAJIPLING. 

1. WarDiiig as to the aBsatnption that three times the ataodard error givee tbe 
ruige for the msjoTity of fluctuations of simple ssiupling of either sign 
—2. Warning as to the use of the observed For the true ralne of ji jn 
the formida for the standard error— 3. The inverse standard error, or 
standard error of the true proportion for a given observed proportion : 
equivalence of the direct and mverse atandaj^ errors when n ia large — 
4-8, The importance of errors other than fluctuations of "simple 
sampling" in practii^e: unrepresentative or biassed Bamples—S-lO. 
Effect of divergences from the condition: of simple sampling: (a) 
effect of variation in j: and g for tbe several uaiverses from nliich the 
samples are drawn^ll-12. {b) Effect of variation in p and ; from one 
snb-claas to another within eauh universe — 13-14. [e) Effect of a 
correlation between the results of the several events — 1 G. Summary. 

1. Thbrb are two warnings as regards the methods adopted in 
the examples in the concluding section of the last chapter 
which the student should note, as they may become of importance 
when the number of observations is small. In tbe first place, he 
should remember that, while we have taken three times the 
standard error as giving tbe limits within which the great 
majority of errors of sampling of either wign are contained, 
the limits are not, as a rule, strictly the same for positive and 
for negative errors. As is evident from the examples of actual 
distributions in g 7, Chap. XIII., the distribution of errors is not 
strictly symmetrical unless p = 9 = 0-5. No theoretical rule as 
to the limits can be given, but it appears from tbe examples 
referred to and from the calculated distributions in Chap. XV. 
§ 3, that a range of three times the standard error includes 
the great majority of the deviations in tbe direction of the 
longer "tail" of the distribution, while the same range on the 
shorter side may extend beyond the limits of the distribution 
altogether. If, therefore, p be less than 0-6, our assumed range 
may be greater than is possible for negative errors, or if ^ be 



RKMOVINQ THE LIUTTATIONS OF KIMFLE SAUPLIN6. 273 

greater than 0'5, greatef than is possible for positive errors. The 
assumption is not, however, likely as a rule to lead to a, serious 
mistake ; as stated at the commencement of this paragraph, the 
point is of importance only when n is small, for when n is large the 
distribution tends to become sensibly symmetrioal even tor values 
of p differing considerably from 0-5. {Of. Chap. XV. for the 
properties of the limiting form of distribution.) 

2. In the second place, the student should note that, where we 
were unable to assign any a priori value to p, we have assumed 
that it is sufficiently accurate to replace p in the formula for the 
standard error by the proportion actually observed, say ir. 
Where n is large ao that the standard error of p becomes small 
relatively to the product p^ the assumption is jiistifiable, and no 
serious error is p<«sible. If, however, n be small, the use of the 
observed value tt may lead to an under- or over-estimation of the 
standard error which cannot be neglected. To get some rough 
idea of the possible importance of such effects, the approximate 
standard error c may first be oaloulated as usual from the 
observed proportion tt, and then fresh values recalculated, replac- 
ing n- by IT ± 3c. It should be remembered that the maximum 
value of the product pq is given byp = j = 0'5, and hence these 
values, if within the limits of fluctuations of sampling, will give 
one limiting value for the standard error. The procedure is by 
no means exact, but may serve to give a useful warning. 

Thus in Example iii. of Chap. XIII. the observed proportion of 
tall plants is 29/68, or, say, 43 per cent. The standard error of 
this proportion is 6 per cent, and o true proportion of 50 per 
cent, is therefore well within the limits of fluctuations of sampling. 
The maximum value of the standard error is therefore 



(™ ';»')'. 606 p., «nl. 

On the other hand, the standard error is unlikely to he lower 
tiian that based on a proportion of 43 - 18 = 25 per cent., 



/25 X 75V = o« 

I — ^o — I = 5-25 per cent. 



3. The two difficulties mentioned in §§ 1 and 2 arise when n, 
the number of cases in the sample, is small. The interpretation 
of the value of the standard error is also more limited in this 
case than when n is large. Suppose a large number of observa- 
tions to be made, by means of samples of m oteervations each, on 
different masses of material, or in different universes, for each of 
which the true value of p is known. On these data we could 

X8 



274 THEORY OF STATISTICS. 

form a correlatioD-table between the true proportiwi p in a given 
univeree and the observed proportion sr in a aample of n obaerva- 
tions drawn therefrom. What we have found from the work of 
the last chapter is that the standard-deviation of an array of x's 
aaaociated with a certain true value p, in this table, is (pql^)* ; 
but the question may be asked— What is the standard-deviation 
of the array at right aisles to this, i.e. the array of p'& associated 
with a certain observed proportion ttI Iq other words, given an 
observed proportion ir, what is the standard-deviation of the true 
proportions^ This is the inverse of the problem with which we 
have been dealing, and it is a much more difficult problem. 
On general principles, however, we can see that if n be lar^, 
the two standard -deviations will tend, on the average of all 
values oip, to be nearly the same, while if n be small the standard- 
deviation of the array of it's will tend to be appreciably the 
greater of the two. For if jr=p-«-S, 8 is unoorrelaf«d with p, 
and therefore if iTp be the standard-deviation of p in all the 
universes from which samples are drawn, <r, the standard- 
deviation of observed proportions in the samples, and o-j the 
standard-deviation of the difference 

But tr| varies inversely as n. Hence if n become very large, trj 
becomes very small, <Ti becomes sensibly equal to o-p, and therefore 
the standard-deviations of the arrays, on an average, are also 
sensibly equal. If n be large, therefore, [5r(l - Tr)/n]* may be 
taken as giving, with sufficient exactness, the standard-deviation 
of the true proportion p for a given observed proportion tt. But 
if n be small, irg cannot be neglected in comparison with cr„ o-, is 
therefore appreciably greater than iTp, and the standard-deviation 
of the array of x's is, on an average of all arrays, correspondingly 
greater than the standard deviation of the array oi p'a — the state- 
ment is not true for every pair of corresponding arrays, especially 
for extreme values of p near and 1. Further, it should be 
noticed that, while the regression of •jr ou p is unity—i.e. the 
mean of the array of it's is identical with p, the type of the 
array — the regression of p on n- is less than unity. If we as- 
sume, therefore, that a tabulation of all possible chances, observed 
for every conceivable subject, would give a distribution of p 
ranging uniformly between and 1, or indeed grouped symmetri- 
cally in any way round 05, any observed value ir greater than 
0"5 will probably correspond to a true value of p slightly lower 
than IT, and conversely. We have already referred to the use of 
the inverse standard error in g 13 of Chap. XIII. (Case II., p. 365). 
If we determine, for example, the standard error of the difference 



BSMOVINO THE LIMITATIONS OF SIMPLE SAMPLDIG. 275 

between two observed proportions hy equation (6) of that chapter, 
thia may be taken, provided n be large, aa approsimatelj the 
standard-deviation of true differences for the given observed 
difference. 

4, The use of standard errors must be exercised with care. It 
is very necessary to renaember the hmited assumptionB on which 
the theory of ttmple tampling is based, and to bear in mind that 
it covers those fluctuations alone which exist when all the assumed 
conditions are fulfilled. The formulee obtained for the standard 
errors of proportions and of their differences have no bearing 
except on the one question, whether an observed divergence of a 
certain proportion from a certain other proportion that might be 
observed in a more extended series of observations, or that has 
actually been observed in some other series, might or might not 
be due to fluctuations of simple sampling alone. Their use is 
thus quite restricted, for in many cases of practical sampling this 
is not the principal question at issue. The principal question in 
many such cases concerns quite a different point, viz. whether the 
observed proportion jt in the sample may not diverge from the 
proportion p existing in the universe from which it was drawn, 
owing to the nature of the conditions under whioh the sample was 
taken, ir tending to be definitely greater or definitely less than 
p. Such divergence between n &nAp might arise in two distinct 
ways, (1) owing to variations of classification in sorting the 
A'i, and a's, the characters not being well defined — a source of 
error which we need not further discuss, but one which may lead 
to serious results \_ef. ref. 5 of Chap. V.]. (2) Owing to either A'a 
or a's tending to escape the attentions of the sampler. To give 
an illustration from artificial chance, if on drawing samples from 
a bag containing a veiy large number of black and white balls 
the observed proportion of black balls was ir, we could not 
necessarily infer that the proportion of black balls in the bag was 
approximately tt, even though the standard error were small, and 
we knew that the proportions in successive samples were subject 
to the law of simple sampling. For the black balls might be, 
say, much more highly polished than the white ones, so as to 
tend to escape the fingers of the sampler, or they might be re- 
presented by a number of lively black insects sheltering amongst 
white stones ; in neither case would the ratio of black balls to 
whit«, or of insects to stones, be represented in their proper pro- 
portions. Clearly, in any parallel case, inferences as to the 
material from which the sample is drawn are of a very doubtful 
and uncertain kind, and it is this uncertainty whether the chance 
of inclusion in the sample is the same for -i's and a's, far more 
than the mero divergences between different samples drawn in 



278 THEORY OF STATIBTICS. 

the same way, which renders many statistical results based on 
samples so dubious. 

5. Thus in collecting returns as to family income and expendi- 
ture from working-class households, the families with lower 
incomes are almost certain to be under-represented ; they largely 
" escape the sampler's fingers" from their simple lack of ability 
to keep the necessary accounts. It is almost impossible to say, 
however, to what extent they are under-represented, or to form 
any estimate aB to the possible error when two such samples 
taken by different persons al different times, or in different places, 
are compared. Again, if estimates as to orop-produetion were 
formed on the basis of a limited number of voluntary returns, 
the estimates would be likely to err in excess, its the persons who 
made the returns would probably include an undue proportion 
of the more intelligent farmers whose crops wo<dd tend to be 
above average. Whilst voluntary returns are in this way liable 
to lead to more or less unrepresentative samples, compulsory 
sampling does not evade the diftculty. Compulsion could not en- 
sure equally accurate and trustworthy returns from illiterate 
and well-educated workmen, from intelligent and unintelligent 
farmers. The following of some definite rule in drawing the 
sample may also produce unrepresentative samples : if samples 
of fruit were taken solely from the top layers of baskets exp4>aed 
for sale, the results might be unduly favourable ; if from the 
bottom layer, unduly unfavourable. 

6. In such oases we can see that any sample, taken in the 
way suf^iosed, is likely to be definitely hicuied, in the sense 
that it will not tend to include, even in the long run, equal 
proportions of the A'& and a's in the original material. In other 
cases there may be no obvious reason for presuming such bias, 
but, on the other hand, no certainty that it does not exist. Thus 
if we noted the hair-colours of the children in, say, one 
school in ten in a large town, the question would arise whether 
this method would tend to give an imbiassed sample of all the 
children. No assured answer could be given : conjectures on 
the matter would be based in part on the way in which the 
schools were selected, e.g. the volunteering of teachers for the work 
might in itself introduce an element of biasi Again, if say 
10,000 herrings were measured as landed at various North Sea 
ports, and the question were raised whether the sample was 
likely to be an unbiassed sample of North Sea herrings, no 
assured answer could be given. There may be no definite reason 
for expecting definite bias in either case, but it may exist, and 
no mere examination of the sample itself can give any informa- 
tion as to whether it exists or no. Got1<*lc 



RRHOTING THE UHITATIONB OF SIMPLE BAHPLINO. 277 

7. Such an examination may be of service, however, as 
indicating one possible source of bias, viz. great heterc^eneitj' in 
the original material. If, for example, in the first illustration, 
the hair-colours of the children differed largely in the different 
schools — much more largely than would be accounted for by 
fluctuations of simple sampling— it would be obvious that one 
school would tend to give an unrepresentative sample, and 
questionable therefore whether the five, ten or fifteen schools 
observed might not also have given an unrepresentative sample. 
Similarly, if the herrings in different catches varied largely, it 
would, again, be difficult to get a. i«presentative sample for a 
large area. But while the dissimilarity of subsamples would 
then be evidence as to the difficulty of obtaining a representative 
sample, the similarity of subsamples would, of course, be no 
evidence that the sample was representative, for some very 
different material which should have been represented might 
have been missed or overlooked. 

8. The student must therefore be very careful to remember 
that even if some observed difference exceed the limitB of fluctua- 
tion in simple sampling, it does not follow that it exceeds the 
limits of fluctuation due to what the practical man would regard — 
and quite rightly regard— as the chances of sampling. Further, 
he must remember that if the standard error be small, it by no 
means follows that the result is necessarily trustworthy : the 
smallness of the standard error only indicates that it is not 
vntrattworthy ovnttg to the magnitvde of jtvetuationt of nmpU 
tamplinff. It may be quite untrustworthy for other reasons : 
owing to bias in taking the sample, for iustauce, or owing to definite 
errors in classifying the A'6 and a's. On the other hand, of course, 
it should also be borne in mind that an observed proportion is not 
necessarily incorrect, but merely to a greater or less extent 
untrustworthy it the standard error be large. Similarly, if an 
observed proportion n-, in a sample drawn from one universe be 
greater than an observed proportion -n^ in a sample drawn from 
another universe, but ir, - xj is considerably less than three times 
the standard error of the difference, it does not, of course, follow 
that the true proportion for the given universes, p^ and p^, are 
moat probably equal. On the contrary, pj most likely exceed^ p^ ; 
the stendard error only warns us that this ooncluston is more or 
leas uncertain, and th&t poitiblff p^ may even exceed p,. 

9. Let ua now consider the eflect, on the standard-deviation of 
sampling, of divergences from the conditions of simple sampling 
which were laid down in g 8 of Chap. XIII. 

First suppose the condition (a) to break down, so that there is 
some essential difference between the localities from which, ,w|^e 



278 THEORY OF STATISTICB. 

coiiditions under which, samples are drawn, or that some essential 
change has taken place during the period of sampling. We may 
represent such circumstances in a oa^e of artlGciiil chance by 
supposing that for the first /j throws of n dice the chance of 
success for each die is p^, for the next/g throws pg, for the neit/g 
throws p^ and so on, the chance of success varying from time to 
time, just as the chance of death, even for individuals of the same 
age and sex, varies from district to district. Suppose, now, that 
the records of all these throws are pooled together. The mean 
number of succeases per throw of the n dice is given by 

^=^iPi+fsPs+ftPs+ ■ ■ ■ ■ ) = «-Po, 

where iV=2(/) is the whole number of throws and jOo is the mean 
value 1.{fp)jN of the varying chance p. To find the standard- 
deviation of the number of successes at each throw consider that 
the first set of throws contributes to the sum of the squares of 
deviations an amount 

"■■PiSi being the square of the standard-deviation for these throws, 
and mCpi-Po) the difference between the mean number of 
Huccesses tor the first set and the mean for all the sets together. 
Hence the standard-deviation a- of the whole distribution is given 
by the sum of all quantities like the above, or 

Let a be the standard-deviation of p, then the last sum is 
If.n^af, and substituting 1 -p for q, we have 

it' == wpo - npl - na^ + m'crj 

= "M. + "("-lK ■ • • 0) 

This is the formula corresponding to equation (1) of Chap. 
XIII. : if we deal with the standard-deviation of the proportion 
of successes, instead of that of the absolute number, we have, 
dividing through by n\ the formula corresponding to equation 
(2) of Chap. XIII., via.— 

^=M- + 'L^^ .... (2) 

10. Ifn be large and «, be the standard-deviation calculated 
from the mean proportion of successes p^ equation (2) is sensibly 
of the form 



RBMOVINQ THE LIMITATIONS OF SIMPLE SAMPLING. 279 



Tablk akoviing FrequenEiet of Jtegialration Distritia in England and Walet 
with Different Proportions of Deatlis in Childbirlh {irKluding Deaths 
from Puerperal Fever) per 1000 Births in the same Year, far the same 
Oraups BflHatricU as in the TabU of Chap. XIII. % 10. Data from same 
Bource. Decade ISSl-SO. 



Deaths per 




Number of Birth 


in the Decade. 


















1000 BinhB. 


1600 
to 


3E0O 
to 


4500 
to 


10,000 
to 


16,000 
to 


30,000 
to 


60,000 
t« 




2500. 


400O. 


GOOD. 


16,000. 


20,000. 


60,000. 


90,000. 


1-5- 2 






2 








2-0- 2-E 
2'6- 3-0 
3-0- 3-E 


1 


- 




1 


— 


— 


— 


1 


e 




4 


~ 


1 


2 


3-6- 4-0 


5 


8 




8 


5 


5 




4-0- 4-5 


e 


5 




23 


4 






4-6- 6-0 


2 


G 




14 


11 






6-0- 5-6 


7 


3 




14 


6 






6-6- 6 


5 


3 




5 








6-0- 6-5 


1 


5 












6-6- 7-0 


3 


1 




8 








70- 7-6 


1 


1 












7'5- 8-0 
















8-0- 8-5 
















8-5- B-0 


1 


1 






1 






B'O- 9-6 
















B-6-10-0 


1 














10-0-10'6 
















10-6-11 -0 


1 


— 


- 


- 


- 


- 


- 


ToUl 


36 


S3 


40 


73 


33 


48 


35 


Meau 


6-29 


1-71 


4-45 




4-99 


6-13 


4 64 


Standard- de- , 


1-77 


1-37 


1'09 


1-01 


0'99 


1-12 


0-87 


Thaoretical 
















standard 'de- 
















Tiationcorre- 


1-62 


112 


0-97 


0-61 


0'53 


0-36 




Bpooding to 
















mw> bMha 
















V«'-V 


0-71 


0-80 


0-61 


0-80 


0-8* 


107 


0-83 



and hence, knowing « and i^, we can find ir^ the standard-deviation 
of the chance or proportion in the universes from which the 
samples have been i&aw n. 

The values of J^ - tj are tabulated at the foot of the table 
showing the distribution of the proportion of mal^ ,birtl^i.in 



280 THEOBY OF STATISTICS. 

certain regietratioi) districta of Eugland, in g 10 of Chap. XIII. 
p. 259. It will be seen that in the first group of Btnall diatricte 
there appears to be a significant stand ard-deTiatioo of some 6 
units in the proportion of male birtha per thousand, but in tbe 
more urban districts this falls to 1 or 2 units ; in one case only 
does I fall short of Sg. In tbe table on p. 279 are given some 
different data relating to the deaths of women in childbirth iu the 
Bame groups of districts, and in this case the effect of definite 
oauses is relatively larger, as one might expect. The values of 
,y(^ — jg suggest an almost uniform significant standard-deviation 
i7j, = 0'8 in the deaths of women per thousand births, five out of 
the eight values being very close to this averf^e. The figures of 
this case also bring out clearly one important consequence of (2), 
viz. that if we make n large s becomes sensibly equal to cr^, while 
if we make n small « becomes more nearly equal to p^qjn. Hence 
if we want to know the significant standard-deviation of the pro- 
portion j> — the measure of its fluctuation owing to definite causes 
— n should be made as large as possible ; if, on tbe other hand, we 
want to obtain good illustrations of the theory of simple sampling 
n should be made small. If n be very large the actual standard- 
deviation may evidently become almost indefinitely large com- 
pared with the standard-deviation of sampling. Thus during the 
20 years 1855-74 the death-rate in England and Wales fluctuated 
round a mean value of 22 '2 per thousand with a standard-devia- 
tion of 0'86. Taking the mean population as roughly 21 millioos, 
the standard-deviation of sampling is approximately 
/ 22x97 8_ 
21xlO»" 
This is only about one twenty -seventh of the actual value. 

II. Now consider the effect of altering the second condition 
of simple sampling, given in § 8 (6) of Chapter XIII., viz. the 
condition that the chances p and q shall be tbe same for every 
die or coin in the set, or the circumstances that regulate the 
appearance of the character observed the same for every individual 
or every subclass in each of the universes from which samples 
are drawn. Suppose that in the group of n dice thrown tbe 
chances for m^ dice are p^ q^ ; for m^ dice, p, q^, and so on, 
the chances varying for different dice, hut being constant 
throughout the experiment. The case differs from the last, as 
in that the chances were the same for every die, at any one 
throw, but varied from one throw to another: now they are con- 
stant from throw to throw, but differ from one die to another as 
they would in any ordinary set of badly made dice. Required to 
find the effect of these differing chaaoes. i \ t/ini > 



yi 



f^- 0-032. 



RBMOVmO THE LIMITATIONS OP SIMPLE SAMPLINQ. 281 
For the meaji number of succeases we evidently have 
M = mjp^ + ni2pj + tngp^+ . . . . 

p^ being the mean chance l^{mp)jn. To find the standard-deviation 
of the number of success^ at each throw, it should be not«d that 
this may be regarded as made up of the nurnber of successeB in 
the iwi dice for which the cbances are pj y,, together with the 
number of successes uroongst the m.j dice for which the chances 
^^^ Pi 1p ^i"^ B° i^ti ■ ^^^ these numbers of successes are all 
independent. Hence 

^"Simpq), 

Substituting 1 -p for q, as before, and using ir,, to denote the 
standard-deviation of p, 

a'' = n.p^n-ntj% . . . ■ (3) 

or if s be, as before, the standard-deviation of the prcportion of 
successes, 

>'=P^^'-^ . . . . (4) 

1 2. The effect of the chances varying for the individual dice or 
other "events" is therefore to lower the standard-deviation, as 
calculated from the mean proportion p/^ and the effect may 
conceivably be considerable. To take a limiting case, if ;> be zero 
for half the events and unity for the remainder, i'o = ?o = J, and 
o-p= ^, so that 8 is zero. To take another illustration, still some- 
what extreme, if the values of p are uniformly distributed over 
the whole range between and 1, Pq ~ 9o ~ i *^ before but Op = 
1/12-0-0833 (Chap. VIII. gJ2, p. 143). Hence »' = Oa667/)i, 
t'=(i'i08/tjn, instead of 05/Vn, the value of « if the chances are 
^ in every case. In most practical cases, however, the effect will be 
much less. Thus the standard -deviation of sampling for a death- 
rate of, say, 18 per thousand in a population of uniform age and 
one *sei is (18 x 982)'/-,/"= 133/Vn. In a population of the age 
composition of that of England and Wales, however, the death- 
rate is not, of course, uniform, but varies from a high value in 
infancy (say IBO per thousand), through very low values (2 to 4 
per thousand) in childhood to continuously increasing values in 
old age ; the standard-deviation of the rate within such a popula- 
tion is roughly about 30 per thousand. But the effect f^ljiia 



THEORY OP STATISTICS. 



variatioD on tbe standard-deviatioii of simple sampling is quite 
Btuall, [or, aa calculated from equation (4), 



as compared with ISS/Jn. 

13. Weliayefinallytopasa tothethirdconditioD(c)ofg8, Chap. 
XIII., and to discuas the effect of a certain amount of de[>endeQoe 
between the several " events " in each sample. We shall Buppoae, 
however, that the two other conditions (a) and (b) are fulfilled, 
the chances p and q being the same for every event at evety trial, 
and constant throughout tbe experinient. The problem is again 
most eimpl; treated on the lines of g S of the last chapter. Tbe 
standard-deviation for each event is (pq)* as before, but the events 
are no longer independent ; instead, therefore, of the simple 
expression 

<r' = n.pq, 

we must have (c/. Chap. XI. § 2) 

<r' = n.pq + 2pq{ry^ + rjs+ ■ ■ ■ ■ ''iis+ ■■■•). 
where, Tj^, r^^ etc. are the correlations between the results of the 
first and second, first and third evente, and bo on^-correlations 
for variables (number of successes) which can only take the 
values and 1, but may nevertheless, of course, be treated as 
ordinary variables (c/. Chap. XI. § 10). There are n(n- l)/2 
correlation-coefficients, and if, therefore, r is the arithmetic mean 
of the oorreUtions we may write 

„•.»«[! +K"-1)]. . . . (5) 
The standard-deviation of simple sampling will therefore be 
increased or diminished according as the average correlation 
between the results of the single events ia positive or negative, 
and the effect may be considerable, as <r may be reduced to zero 
or increased to m^)'. For the standard deviation of the propor- 
tion of successes in each sample we have tbe equation 

■■-"[I +•■(»- 1)1 ■ ■ ■ • (6) 

It should be noted that, as the means and standard-deviations 
for our variables are all identical, r is tbe correlation-coefficient 
for a table formed by taking all possible pairs of results in the 

n events of each sample. ( iMl<>k' 



RBMOVIKQ THE UM1TATI0N8 0? SIMPLE SAHPLINO. 283 

It ahotild also be noted that the case when r is poBitive covers 
the departure from the rules of aimple sampling disctisBed in 
§§ 9-10 : for if we draw Bucceesive samples from different records, 
this introduces the positive correlation at once, even although the 
results of the events at each trial are quite independent of one 
another. Similarly, the caae discussed in §§ 11-12 is covered by 
the case when r is negative ; for if the chances are not the same 
for every event at each trial, and the chance of success for some 
one event is above the average, the mean chance of Btiocess for the 
remainder must be below it. The cases (a), (6) and (c) are, how- 
ever, best kept distinct, since a positive or negative correlation 
may arise for reasons quite different from those discussed in 
§§ 9-12. 

14. As a simple illustration, consider the important case of 
sampling from a limited universe, e.ff. of drawing n balls in 
succession from the whole number «> in a bag contain ingpifi white 
balls and ^ black balls. On repeating such drawings a large 
number of tiraea, we are evidently equally likely to get a white 
ball or a black ball for the first, second, or nth ball of the sample : 
the correlation-table formed from all possible pairs of every sample 
will therefore tend in the long-run to give just the same form of 
distribution as the correlation-table formed from all possible pairs 
of the IB balls in the bag. But from Chap. XI. § II we 
know tliat the oorrelation-coefGcient for this table is - l/(vi - 1), 
whence 



».P?— 



■1 



It n= 1, we have the obviously correct result that <T = (pqy, as 
in drawing from unlimited material: if, on the other hand, n = w, 
<r becomes zero as it should, and the formula is thus checked for 
simple cases. For drawing 2 balls out of 4, o- becomes 0'816 
(npg')' ; for drawing 5 balls out of 10, 0-745 (npq)* ; in the case 
of drawing half the balls out of a very large number, it approxi- 
mates to {0-5,npqy, or 0-707 {npqy. 

In the case of contagious or infectious diseases, or of certain 
forms of accident that are apt, if fatal at all, to result in whole- 
sale deaths, r is positive, and if n be large (as it usually is in such 
cases) a very small value of r may easily lead to a very great increase 
in the observed standard-deviation. It is difficult to give a really 
good example from actual statistics, as the conditions are hardly 
ever oonst&nt from one year to another, but the foUo^ng, iwill 



284 THEORY OF STATISTICS. 

serve to illustrate the point. During tlie twenty years 1887-1906 
there were 2107 deaths from explosions of firedamp or coal-dust 
in the coal-mines of the United Kingdom, or an average of 105 
deaths per annum. From § 12 of Chap. Xlll. it follows that this 
should be the square of the standard-deviation of simple sampling, 
or the standard-deviation itself approximately 10-3. But the 
square of the aotual standard-deviation is 7178, or its value 84'7, 
the numbers of deaths ranging between 14 (in 1903) and 317 
(in 1894). This large standard-deviation, to judge from the 
figures, is partly, though not wholly, due to a general tendency to 
decrease in the numbers of deaths from explosions in spite of a 
large increase in the number of persons employed ; but even if we 
ignore this, the magnitude of the standard-deviation can be 
accounted for by a very small value of the correlation r, expressive 
of the fact that if an explosion is sufficiently serious to be fatal to 
one individual, it will probably be fatal to others also. For if erg 
denote the standard-deviation of simple sampling, <r the standard- 
deviation of sampling given by equation (5), we have 



Whence, from the above data, taking the numbers of petrsonB 
employed underground at a rough averse of 560,000, 

'-Mmnrm- ^o-o'O'^- 

15. Summarising the preceding paragraphs, ^ 9-14, we see 
that if the chances p and q differ for the various universes, 
districts, years, materials, or whatever they may be from which 
the samples are drawn, the standard-deviation observed will be 
greater than the standard-deviation of simple sampling, aa 
calculated from the avenge values of the chances : if the avenge 
chances are the same for each universe from which a sample is 
drawn, but vary from individual to individual or from one sub- 
class to another within the universe, the standard-deviation 
observed will he less than the staudard-deviatioo of simple 
sampling as calculated from the mean values of the chances : 
finally, if p and q are constant, but the events are no longer 
independent, the observed standard-deviation will be greater or 
less than the simplest theoretical value according as the corre- 
lation between the results of the single events is positive or 
negative. These conclusions further emphasise the need for 
caution in the use of standard errors. If we find thi^ the 



REMOVING THE LIMITATIONS OP SIMPLE SAMPUNG. 285 

Btandard-deTiatioa iu some case of sampliag exceeds the stendard- 
deviation of simple sampling, two interpretations are posBible : 
either that p and q are different in the various universes from 
which samples have been drawn {i.e. that the variations are 
more or less definitely significant in the sense of !^ 13, Chap. XIII.), 
or that the reaiilta of the events are positively correlated inter 
se. If the actual standard-deviation fall short of the standard- 
deviation of simple sampling, two interpretations are again 
possible, either that the chances p and 5 vary for different 
individuals or sub-classes i» each universe, while approximately 
constant from one universe to another, or that the results of 
the events are negatively correlated inter se. Even if the 
actual standard-deviation approaches closely to the standard- 
deviation of simple sampling, it is only a conjectural and not 
a necessary inference that all the conditions of " simple sampling " 
as defined in § 8 of the last chapter are fulfilled. Possibly, tor 
example, there may be a positive correlation r between the 
resulte of the different events, masked by a variation of the 
chances jo and q in sub-classes of each universe. 

Sampling which fulfils the conditions laid down in § 8 of 
Chap. XIII., simple sampling as we have called it, is generally 
spoken of as random sampling. We have thought it better to 
avoid this term, as the condition that the sampling shall be 
random — haphazard— is not the only condition tacitly assumed. 



REFERENCES. 

Cf. generally the references to ChBli. XIII., to which may be 

(1) PEARaoN, Karl, " On i^rtain Prapertiea at the HvpeigeometriCBl Series, 
and on the fitting of euch Series to Obsemtion Polygons in the Theory of 
Chance," Fhiloaop/iical Magaiine, 6th Series, vol, xlvii., 1888, p. 238. 
(An eipanaion of one section of raf. 10 of Chap. XIII., dealing with the 
first problem of our § 11, i.e. drawing samplea from a b«g containing 
a limited number of whil;« and black balls, from the atandpointof the 
frequenfy-diatribution of the number of white or black balls in the 
samples. ) 

EXERCISES. 

1. Referriag to Question 7 of Chap. XIII., work out the values of the 
significant standard- deviation sj, (as in % 10] for each row or group of rows 
there given, but tailing row fi with rows B and 7 

2. tor all the diatricta in England and Wales included in the same table 
(Table VI. , Chap, IX. ) the standard -deviation of the proportion of male births 
iwr 1000 i>f all births ia 7-46 and the mean proportion of male births 609-2. 
The hai monio mean number of births in a distnct la 6070 Find the signiheant 
standard-deviation d-d. , i 



286 THEORY OF STATISTICS. 

3. If for one half of n events the chuncs of suooeai 19 p and the ohance of 
r&ilure g, whilst for the other half the chance of success is g and the chance of 
failure p, what is the standard -da riation of the number of sueeeaaes, the eventa 
being iill independent? 

4. The following are the deaths from small-pox dnring the 20 years 
1882-1901 in England and Wales : - 

1882 1317 1692 431 

83 967 93 14E7 



Bl 49 leOl 356 

The death-rate from amall-poi being very small, the rule of g 12, Chap. 
SIII., may be applied to estimate the standard -deviation of simple sampling. 
Assuming that the excess of the actual standard-deviation over this oan be 
entirely accounted for by a correlation between the reaulta of eipoaure to risk 
of the individuals composing the papulation, estimate r. The mean populatioi) 
daring the period may be t^en in round numbera as 29 millions. 



n,gN..(JNGOOglC 



CHAPTER XV. 

THE BINOHUL BISTEIBnTIOH AND THE 
NOBHAL CUBVB. 

1-2. Determination of the frequency -distributjon far the number of el 

in n events : ihe binomial distrihutian — 3. Dependence of the form 
of the distribution on p. q and n — 4-5. Qr&|)hical and mechanical 
methods of forming rspreaen cations of the binomial distribution^ 
6. Direct calculation of the mean and the standard-deviation from 
the diBtributioQ — 7-8. NeceBsitj of deducing, for use in many 
practical cases, a continuous curve giving approximately, for laive 
values of n, the terms of the binomial series— B. Dediiction of the 
normal curve as a limit to the aymmetrioal binomial — 10-11. The 
value of the central ordinate — 12, Comparison with a binomial dis- 
tribution for a moderate value of n— 13. Outline of the more general 
conditions from which the curve can be deduced by advanced methoda — 
li. Fittingthecurvetoanactual seriesof observations— 15. Difficulty 
ofa complete teat of tit by elementary methods— IB. The table of areas 
of the normal curve and its use — 17 The quartile deviation and the 
"probable error" — IS. lUnstrations of the application of tlie normal 
curve and of the table of areas. 



more important caees, and the applications of the results indicated. 
For the simpler cases of artificial chance it is possible, however, to 
go much further, and determine not merely the standard -deviation 
but the entire frequency-distribution of the number of "successes." 
This we propose to do for the case ot "simple sampling," in which 
all the events are completely independent, and the chances /> and 
q the same for each event and constant throughout the trials. 
The case corresponds to the tossing of ideally perfect coins (homo- 
geneotia circular discs), or the throwing of ideally perfect dice 
(homogeneous cubes): 

2. If we deal with one event only, we expect iu N trials, Nq 
failures and Np successes. Suppose we now combine with the 
results of this first event the results of a second. The two events 
are quite independent, and therefore, according to the rule of 



THEORY OF STATlSTICa. 



» 




»i 

« 

+ 
S 


1 J 

+ 

it. 

al -k 

% S 

"a 

^ 

^ "■' ^ 




< 

+ 

% 

I 

% 

+ 

s 

'k 

+ 

'k 

^ 


1 


1 


1 


s 

1 


1 
1 


s 

1 


_ 


■- i " 


^AHW 





THE BINOMIAL DI8TRIBUTI0N AND THE NORMAL CURVE, 289 

independence, of the Nq failures of the first event {Nq)q will be 
associated (on aa average) with failures of the second event, and 
{Ifq)p with successea of the second event {ef. row 2 of the scheme 
on p. 288). Similarly of the N'p successful first events, {JVp)q will 
be associated (on, an average) with failures of the second event 
and {Np^ with successes. In trials of two event« we would 
therefore eipect approiimately Nq} cases of no success, 2Npq 
cases ot one success and one failure, and Mp^ cases of two successes, 
as in row 3 of the scheme. The reaults of a third event may be 
combined with those of the first two in precisely the same way. 
Of the Jfq^ cases in which both the first two evente failed, {Nq^q 
will be associated (on an average) with failure of the third also, 
{Nq^)p with success of the third. Of the 2Npq cases of one 
success and one failure, {2N'pq)q will be associated with failure 
of the third event and (2Npq)p with success, and similarly for 
the Jfp^ cases in which both the first two events succeeded. The 
result is that in JT trials of three events we should expect iVj* 
cases of no success, 3 2f^pq^ cases of one success, 3 Jfp''q cases of two 
successes, and N^ cases of three successes, as in row 5 of the 
scheme. The scheme la continued for the results of a fourth 
event, and it is evident that all the results are included under a 
very simple rule : the frequencies of 0, 1, 2 ... . successes are 
given 

for <me event by the binomial eipausiou of Nlq+p) 
for two events „ „ ^{S'^pY 

for t&ree events „ „ --Nlq+pY 

for /oiw events „ „ ^(q+p)* 

and soon. Quite generally, in fact : — the frequencies of 0, 1, 2 ... . 

xuceeKet in N trial* of n event* are given by the guccemve term* 
in the hinomiai expansion of N{q+p)'', viz. — 

y{}-+^!,->+^J^'.9-y<. ''''-'X^-V y+ . . . . } 

This is the first theoretical expression that we have obtained for 
the form of a frequency-distribution. 

3. The general form of the distributions given by such 
binomial series will have been evident from the experimental 
examples given in Chapter XIII. , i.e. they are distributions 
of greater or less asymmetry, tailing off in either direction 
from the mode. The distribution is, however, of bo much 
importance that it is worth while considering the form in 
greater detail. This form evidently depends (1) on the values 
of q and p, (2) on the value of the exponent n. If p and q 
are equal, evidently the distribution must be symmetrical, for 

19 



290 THEORY OF STATISTICS. 

p and q may be interchtinged without altering the value of 
any term, and coBsequently terms equidiBtant from either 
end of the aeries are equal. If p and q are unequal, on the 
other hand, the distribution is asymmetrical, and the more 
asymmetrical, for the same TaJue of n, the greater the inequahty 
of the chanoBB. The following table shows the calculated 
diatributiona for n = 20 and values of p, proceeding by 0.1, 
from 0,1 to 0.5. Whenp = 0.1, oaaes of two BUcoeeBes are the 



Number of 


y=0-l 


y = 0-2 


P = 0-3 


p = 0-4 


p = 0-6 




, = 0.9 


j=0-8 


5=0-7 


J = 0-6 


S^O-6 




1216 


HE 


S 




_ 




2702 


676 


ss 


E 






28C2 


1869 


278 


31 


2 




1901 


2054 


716 


123 


11 




(i»8 


2182 


1301 


350 


46 




SIB 


I74e 


1788 


716 


IIS 




89 


lOSl 


isia 


1244 


370 




20 


ei6 


1643 


1869 


789 




4 


222 


1141 


1797 


1201 




1 


74 


654 


1B97 


1602 






20 


308 


1171 


1762 






B 


120 


710 


1602 








SB 


365 


1-201 


IS 






10 


116 


73B 








2 


49 


370 










18 


148 


20 


' — 


— 


— 


3 


IS 


- 


- 


- 


~ 


2 


- 


- 


- 


- 





most frequent, but cases of one succ 
even nine auccesHes may, however, * 
trials. Ab ;> is increased, the position of the 
frequency gradually advances, and the two tails of the diatribution 
become more nearly equal, until ^ = 0.5, when the distribution 
is symmetrical. Of course, if the table were continued,' the 
distribution for ^ = 0.6 would be similar to that tor j = 0,6, 
but reversed end for end, and so on. Since the standard- 
deviation is (npq)* and the maximum value of pq is given by 
p = q, the symmetrical distribution has the greatest dispersion. 



THE BINOMIAL DISTRIBUTION AND THE NORMAL CURVE. 291 

If p = 9 the effect of increasing n is to raise the mean and 
increase the dispersion. If p is not equal to q, however, not 
only does an increase in n raise the mean and increase the 
dispersion, but it also lessens the asymmetry; the greater 
n, for the same value of p and q, the less the asymmetry. 
Thus if we compare the first distribution of the above table 
with that given by fi= 100, we have the following : — 



B.— Terms o/lhe Binomial Series 10,000 (O'B + 0-1 ji"". {Fibres given 



Number 




Namber 




Number 




of 


Frequency. 


of 


Frequency. 


of 




SucocBBes. 




Sucowecs. 




Successes. 









8 


1148 


16 


193 




3 


9 


1304 


17 


108 




Ifi 


10 


1319 


IS 


64 




59 




11S» 


19 


26 




1C9 


12 


988 


20 


12 




83B 


13 


748 


21 


5 




ess 


14 


513 


22 


2 




S89 


15 


327 


23 


1 



The maximum frequencies now occur for^ and 10 succegses, 
and the two "tails" are much more nearly equal. It, on the 
other handj__fi is reduced to 2, the distribution is — 

Number of Successes. Frequency. 

8100 

1 1800 

2 100 

and the maximum frequency is at one end of the range. What- 
ever the values of ^ and q, if n is only increased sufficiently, the 
distribution may be treated as sensibly symmetrical, the necessary 
condition being (we state this without proof) that p -q shall be 
small compared with the standard-deviation Jnpq. It is left 
to the student to calculate as an exercise the theoretical distribu- 
tions corresponding to the experimental results cited in Chapter 
XIII. (Question 1). 

4. The property of the binomial series used in the scheme of 
§ 2 for deducing the series with exponent .n from that with 
exponent n-1 leads to two interesting methods— graphical and 
mechanical — for constructing approximate representations of 



292 



THKOKY OF STATISTICS, 



binomial diBtributione. It will have been noted that aay one 
term — say the rth — in one series is obtained by taking q timea the 
rth t«rm together with p timea the {r-l)th term of the preceding 
series. Now if AP, OR (figure 46) be two verticals, and a third, 
BQ, be erected between them, cutting PR in Q, so that 
AB:BC::q:p, then 

BQ=p.AP + q.GR. 

(This follows at once on joining AR and considering the two 
segments into which BQ is divided.) Consider then some 
binomial, say for the case p = \,q = \- Draw a series of verticals 
(the heavy verticals of fig. 47) at any convenient distance apart 




on a horizontal base line, and erect other verticals (the lighter 
verticals) dividing the distance between them in the ratio of 



q:p, • 



:. 3:1. Next, choosing a vertical scale, i 



V the binomial 



polygon for the simplest case n=\ ; in the dit^ram iT has been 
taken-4096, and the polygon is oJcd, oJ = 3072, lc=1024. The 
polygons for higher values of n may now be constructed graphi- 
cally. Mark the points where ab, be, cd respectively cut the 
intermediate verticals and project them horizontally to the right 
on to the thick verticals. This gives the polygon ab'c'de for 
n = 2. Yatob' = q.oh, \e' =p.ob + q.\c,&'DAea oa. Similarly, if the 
points where ah', b'c, etc., cut the intermediate verticals are 
projected horizontally on to the thick verticals, we have the 
polygon aV'd'^'e'f for n = 3. The proc^ ma^ ,l;i^ ^^tinued 



THE BINOMIAL DISTRIBUTION AND THE NORMAL CURVE. 293 



indefinitely, though it will be found difficult to maintain any 
high degree of accuracy after the first few constructions. 





































/T 










'■«/ 


V -s 










A 


•'V 








jx/' . 




/ 














•«J 


















"-^ 










~^ 


:;;-.,■> 


^ 


^ 


1 


< 


1 


1 ' 


1 


s> 



5. The mechanical method of coastructing the representation of 
a binomial Beries is indicated diagrammatical ly by fig. 48. The 



294 



THEORY OF STATISTICS. 



ftpparatuB conaiste of a tunnel opening into a space — say a J inch in 
depth— between a sheet of glass and a back-board. This space is 
broken up by successive rows of wedges like 1, 2 3, 4 5 6, etc., which 
will divide up into streams any granular material such as shot or 
mustard seed which is poured through the funnel when the 
apparatus is held at a slope. At the foot these wedges are 
replaced by vertical strips, in the spaces between which the 




Flu. 48.— The Pearson -Galton Binomial Apparatua. 



material can collect. Consider the stream of material that 
comes from the funnel and meets the wedge 1. This wedge is 
set so as to throw q parte of the stream to the left and p parts 
to the right (of the observer). The wedges 2 and 3 are set so as 
to divide the resultant streams in the same proportions. Thus 
wedge 2 throws q^ parts of the original material to the left and 
qp to the right, wedge 3 throws pq parts of the original material 
to the left and p^ to Uie right. The streams passing these wedges 
are therefore in the ratio of q^ : 2qp ; p\ The next row of wedges 
is again set so as to divide these streams in the same proportions 



THK BINOMIAL DISTRIBUTION AND THE NORMAL CDRVE. 295 

as before, and the four streams that result will bear the propor- 
tions 5* : 35^ : 32p2 : pS "pijg g^al set, at the heads o! the 
vertical strips, will give the streams proportions q* : i^ : 6y^° ; 
iqp^ : p*, and these streams will accumulate between the stripa 
and give a representation of the binomial by a kind of histogram, 
as shown. Of course as many rows of wedges may be provided 
as may be desired. 

This kind of apparatus was originally devised by Sir Francis 
Galton (ref. 1) in a form that gives roughly the symmetrical 
binomial, a stream of ahot being sdlowed to fall through rows of 
nails, and the resultant streams being collected in partitioned 
spaoes. The apparatus was generalised by Professor Pearson, 
who used rows of wedges fiied to. movable slides, so that they 
could be adjusted to give any ratio of q ip. (Ref. 11.) 

6. The values of the mean and standard-deviation of a binomial 
distribution may be found from the terms of the series directly, 
as well as by the method of Chap. XIII. (the calculation was 
in fact given as an exercise in Question 8, Chap. VII., aud 
Question 6, Chap. VIII.). Arrange the terms under each other 
as in col. 1 below, and treat the problem as if it were an arith- 
metical example, taking the arbitrary origin at successes: as 
jV is a factor all through, it may be omitted for convenience. 



(1) 

Frequency/. 


(!) 
Dey.t. 


(S) 

A- 


(1) 
/!"■ 


i.a «^ ^ 


1 
2 






"<»-l)(.-2).,,, 




»(.-lX>-2|. ., 


»•<-')("-»).,.- 



The sum of col. 1 is of course unity, i.e. we are treating Jf as 
unity, and the mean is therefore given by the sum of the terms 
in col. (3). But this sum is 

That is, the mean 3f is np, as by the method of Chap. XIJJ« j,- 



THBORT OF 8TATISTICB. 



The square of the Btandard-deviation is giTen l>y the eiHb ^f 
the terma in col. (4) lesB the square of the mean, that ia, 



^=np\q'-'+2{n-i)cr-'p+a'- — j^^^ — '-r-Y- 



. (■-iX"- 

..2 



But the Beries in the bracket is the binomial aeries (q+p)"~' 
with the successive terms multiplied by 1, 2, 3, . . . It therefore 
gives the difference of the mean of the aaid binomial from-1, 
and its sum is therefore (n-l}p + 1. Therefore 

tr« = np{(»-I)p + l}-nV 
= ?ip - np^ = npq. 

7. The t«rms of the binomial series thus afford a means of 
compleUlp describing a certain class of frequency-distributions — 
i.e. of giving not merely the mean and standard-deviation _ in 
eat^h case, but of describing the whole form of the distribution. 
If N samples of n cards each be drawn from an indefinitely large 
record of cards marked with ^ or o, the proportion of ^-oards 
in the record being p, then the successive terms of the series 
^i9+p)" gi™ t^^ frequencies to be expected in the long run of 
0, 1, 2, . . . ^-oards in the sample, the actual frequencies only 
deviating from these by errora which are themselves fluctuations 
of sampling. The three constants iV, p, n, therefore, determine 
the average or smoothed form of the distribution to which actual 
distributions will more or lees closely approximate. 

Considered, however, as a formula which may be generally 
useful for describing frequency-distributions, the binomial series 
auflfers from a serious limitation, viz. that it only applies to a 
strictly discontinuous distribution like that of the number of 
^-cards drawn from a record containing A'b and a'e, or the number 
of heads thrown in tossing a coin. The question arises whether 
we can pass from this discontinuous formula to an equation 
suitable for representing a continuous distribution of frequency, 

8. Such an equation becomes, indeed, almost a necessity for 
certain cases with which we have already dealt. Consider, for 
example, the frequency-distribution of the number of male births 
in batches of 10,000 births, the mean number being, say, 5100. 
The distribution will be given by the terms of the series 
{O'iS-t-O'Sl)^**"* and the standard-deviation is, in round numbers, 
60 births. The distribution will therefore eitend to some 150 
births or more on either side of the mean number, and in order 
to obtain it we should have to calculate some 300 terms of a 
binomial aeries with an exponent of 10,000! This would not 
only be practically impossible without the use of certain methods 
of approximation, but it would give the distribution miauite . 



vb ^ <a> 



TBB SlNOUtAL DtSTRIbUTtON AMD THK NOBMAL CDBVE. 297 

unueoeasar; detail ; aa a matter of practice, ne would not have 
compiled a frequency-diBtribution hy single male births, but 
would certainly have grouped our obserrations, taking probably 
10 births as the class-interval. We want, therefore, to replace the 
binomial aeries by some continuous curve, having approximately 
the same ordinatea, the carve being auch that the area, between 
any two ordinates y, and y^ will give the frequency of observations 
between the corresponding values of the variable Xi and x^ 

9. It is possible to find such a continnoua limit tothe binomial 
series for any values of p and q, but in the present work we will 
confine ourselves to the aimplest case in which p = q = 0'6, and the 
binomial is symmetrical. The terms of the aeries are 

The frequency of m successes is 

and the frequency of m + 1 suoceases is derived from this by 
multiplying it by (n-m)/{m+l). The latter frequency is 
therefore greater than the former so long as 



Suppoae, for simplicity, that n is even, say equal to 2k ; then the 
frequency of k successes is the greatest, and its value is 

y.-mrf^ ■ ■ ■ . (1) 

The polygon tails off symmetrically on either side of this greatest 
ordinate. Consider the frequency oi k + x successes ; the value ia 
\2k 
^' = ^(i)Tlim:i ■ ■ ■ (2) 

and therefore 

S. W(t-l)(t-g) (*-»+!) 

y,- (4 + lKi+2)(* + 3) .... (H + x) 

(--DO-DO-l)----(--^) , „^ 



298 THBORT OP STATISTICS. 

Now let ua approximate by assuming, as suggested in g 8, that 
i is very large, and indeed lar^e compared with x, so that {xf&y 
may be neglected compared with {x/k). This assumption does 
not involve any difficulty, for we need not consider values of x 
much greater than three times the sUndard-deviation or 3 -Jk/W, 
and the ratio of this to A^ is 3/ •J2k, which is necessarily small if k 
be lai^. On this assumption we may apply the logarithmic 
series 

iog^i+8)_s-^+?-5!+.... 

to every bracket in the fraction (3), and neglect all terms beyond 
the first. To this degree of approximation, 

logg=-|(l + 2 + 3+. ...+;?ri)-| 

__x(x-l) ^ 



y.=i/ifi =y<fi 



Therefore, finally, 

... (1) 

where, in the last expression, the constant k has been replaced by 
the standard-deviation ir, for ir^ = i/2. 

The curve represented by thia equation is symmetrical about 
the point x — 0, which gives the greatest ordinate j/ = yQ. Uean, 
median, and mode therefore coincide, and the curve is, in fact, 
that drawn in fig. 5 and taken as the ideal form of the symmetri- 
cal frequency-distribution in Chap. VI. The curve is generally 
known as the normal curve of errors or of frequency, or the law 
of error. 

10. A normal curve is evidently defined completely by giving 
the values of jTq and o- and assigning tbe origin of x. If we 
desire to make a normal curve fit some given distribution as near 
as may be, the last two datA are given by the standard-deviation 
and the mean respectively ; the value of y,, will be given by the 
fact that the areas of the two distributions, or tbe numbers of 
observations which these areas represent, must be the same. 

This condition does not, however, lead in any simple and 
elementary algebraic way to an expression for y^, though such 
a value could be found arithmetically to any desired d^^ree 
of approximation. For it is evident that (1) any alteration in 



THE BINOMIAL DISTRIBUTION AMD THE NOBMAL CUKVK. 299 

jfo produces a proportionate alteration in the area of the curve, 
e.g. doiibliog y„ doubles every ordinate y^ and therefore doubles 
the area: (2) any alteration in cr produces a proportionate 
alteration in the area, for the values of y, are the same for the 
same values of xja, and therefore doubling o- doubles the distance 
of every ordinate from the moan, and consequently doubles the 
area. The area of the curve, or the number of obaervatioas 
represented, is therefore proportional to y^er, or we must have 

where a is a numerical constant. The value of a may be found 
flrpproiimat^ly by taking y^ and tr both equal to unity, calculating 
the values of the ordinates y^ for equidistant values of x, and 
taking the area, or number of observations N, as given by the 
sum of the ordinates multiplied by the interval. 

11. The table below gives the values of p for values of x 
proHieeding by fifths of a unit ; the values are, of course, the same 
for positive and negative values of x. For the whole curve the 
sum of the ordinates will be found to be 12'533I8, the interval 
being 0^2 units; the area is therefore, approiimately, 2'50664, 



OTdmata<^theOur<>ev = e '. 


[For re/irencei to more exteudtd 




tabla. Ml list 


<mjV-S5B-4.) 




- 


,. 


Logy. 


X. 


y. 


l^gy. 


1 


DOOOO 




2-8 


■03406 


3'fi320B 


0-2 


B8020. 


-99131 


2-8 


■01984 


3-29757 


0-4 


92312 


■96528 


3-0 


■01111 


S'04567 


0( 


83527 


■S2188 


"8 -2 


■O05&8 


S77641 


0-8 


72615 


■86103 


8-4 


■OO309 


3^48»78 


10 


60653-/ 


■78285 


_?-e 


-00158 


3-18677 


1-2 


48676 


■68731 


3-8 


■00078 


?-864S9 


lA 


376S1 


■57439 


4-0 


■00034 


?^62664 


1-e 


27804 


■44410 


1'2 


■00015 


J-16B62 


1-8 


18790 


■29644 


1-4 


■00008 


6-79603 


2 


13534 


■13141 


^■6 


■0OO03 


B-40616 


2-2 


08892 


'94901 


4 8 


■00001 


6-99693 


2-4 


-05614 


■74923 


B-0 


■boooo 


5-67182 



and this is the approximate value of a. The' value is more than 
sufficiently accurate for practical purposes, for the exact value 
is V2t=2-506627 .... The proof of this value cannot be given 
here, but it may be deduced from an important approximate 
expression for the factorials of large numbers, due to^Jamea 



THEORY OF BTATISTICS. 



Stirling (1730). If n be large, we have, to a high degree of 
approximation. 



The complete expreaaion for the normal curve ia therefore 



.w. 



(6) 



The exponent may be written x^jr? where e-^ s/S.er, and this ie 
the origin of the use of -J^itrr (the "modulus") as a measure 
of dispersion, of 1/ •J2.<t as a measure of " precision," and of 20"^ 
as "the_fluctuation"(c/. Chap. VIII. g 13). The uae of the fiwtor 
2 or v2 becomea meaningless if the distribution be not normal. 

Another rule cited in Chap. VIII., viz. that the mean deviation 
is approximately 4/5 of the standard-deviation, is strictly true 
for the normal curve only. For this distribution the mean 
deviation — er ■J%jv= 0-79788 . . . . <r : the proof cannot be given 
within the limitations of tfae present work. The rule that a 
range of 6 times the standard-deviation includes the great 
majority of the observations and tfaat the quartile deviation is 
about 2/3 of the standard-deviation were also suggested by the 
properties of this curve (see below §§ 16, 17). 

12. In the proof of g 9 the assumption was made that k (the 
half of the exponent of the binomial) was very large compared 
with X (any deviation that had to be considered). In point 
of fact, however, the normal curve gives the terms of the 
symmetrical binomial surpriaingly closelj even for moderate 
values of n. Thus if w=64, i = 33, and the standard-deviation 
ia 4. Deviations x have therefore to be considered up to ±12 
or more, which is over 1/3 of k. As will be seen, however, from 
the annexed table, the ordinates of the normal curve agree with 
those of the binomial to the nearest unit (in 10,000 observations) 
up to x-= ±15. The closeness of approximation is partly duo 
to the fact that, in applying the logarithmic aeries to the 
fraction on the right of equation (3), the terms of the second 
order in expansions of corresponding brackets in numerator and 
denominator cancel each other: these terms, ther^of^,,^ not 



THE BINOMIAL DISTRIBUTION AND TBE NORMAL CORVK. 301 

accumulate, but ouly the terms of the third order. There ta 
only one second-order term that liae been neglected, viz. that due 
to the last bracket in the denominator. Even for much lower 
values of n than that chosen for the illustration — e^. 10 or 12 
(c/. Qu. 4 at the end of this chapter) — the normal curve still 
gives a very fair approximation. 



Tablb sAoun'n^ (1) Ordinala of the Binomial Seria 10,000 {\ + \f* aad 
10. OOP -5 



{2) CoTTtspimding Ordinala of the Normal Citrve y — 



Term. 


Binomial 


Norra.1 


Term. 


Binomial 


Kormal 


Series. 


Curve. 


Seiios. 


Curve. 


S2 


993 


997 


24 and 40 


13« 


136 


31 and 33 


9S3 


M7 


28 „ 41 


SO 


79 


SO „ 34 


878 


SSQ 


22 „ 42 




44 


29 „ 36 


763 


763 


21 „ 43 


23 


23 


28 ., 38 


aoe 


606 


20 „ 44 


11 


11 


27 „ 87 


469 


457 


19 ., 46 


5 


5 


26 „ 38 


32(1 


324 


IS „ 4fl 


2 




25 „ SS 


217 


21S 


17 „ 47 


1 


' 



13. But if the normal curve were limited in its application to 
distributions which were certainly of binomial type, its use in 
practice (ajHirt from its theoretical applications to many cases of 
the theory of sampling) would be very restricted. As suggested, 
however, by the illustrations given in Chap. VI., a certain, thougli 
not a large, number of distributione^more particularly among 
those relating to measurements on man and other animals — are 
approiimately of normal form, even although such distributions 
have not obviously originated in the same way as a binomial 
distribution. Take, for example, the distribution of statures in 
the United Kingdom (Chap, VI., Table VI.). The mean stature 
is 6746 inches, the standard-deviation 2-57 inches (the values are 
worked out in the illustrations of Chaps. VII. and Vlll.), and the 
number of observations SB 85. This gives y(|=1333, and all the 
data necessary for plotting a normal curve of the same mean and 
standard -deviation (the process of fitting ia dealt with at greater 
length in § 14 below). The two distributions are shown together 
in fig. 49, the continuous curve being the normal curve, and the 
small circles showing the observed frequencies. It is evident that 
they agree very closely. Other body measurementa, e,ff. skull 
measurements, etc., also follow the normal law ; it also applies to 
certain characters in plants (e.y. number of seeds per capsule in 



302 



THEORY OF STATISTICS. 



Lotut, Pearl, Amerietm Nalv/ralut, Not. 1906). The queetion 
arises, therefore, why, in such cases, the distribution should be 
approximately normal, a form of distribution which we have only 
shown to arise if the variable is the sum of a large number of 
elements, eaoh of which can take the values and 1 (or other two 
ooustaot values), these values occurring independently, and with 
equal frequency. 

Id the first place, it should be stated that the conditions of the 
deduction given in § 9 were made a little unnecessarily restricted, 

























































/ 


































































































































































































































\ 






















\ 










^ 














■•-• 





se S8 60 es e4 ee es 70 i2 74 le 78 so 

SlaUti'A Ov inches. 
Flo. IS.— 'The Distribution of Suture for Adult M&les in the British lales 
(fig. 8, p. 88), fitted with a Normal Curve : to avoid oonfiising tie 
figure, the fr^uencj-polfgon has not been drawn in, the taps of the 
onlinates being shown by small circles. 

with ft view to securing simplicity of algebra. The deduction 
may be generalised, whilst retaining the same type of proof, by 
assuming that p and q are unequal (provided ^ - 5 be small 
compared with Jnpq, cf. § 3), that p and 5 are not quite the 
same for all the events, that all the events are not quit« inde- 
pendent, or that n is not lai^e, but that some sort of continuous 
variation is possible in the values of the elementary variables, 
these being no longer restricted to and 1, or two other discrete 
values. (&/. the deduction given by Pearson in ret. 11.) Pro- 
ceeding further from this last idea, the deduction may be rendered 



THE BINOMIAL DISTRIBUTION AND THE NOEMAL CUBVE. 303 

more general still, without introducing the conception of the 
binomial at all, by founding the curve on more or less complex 
caseB of the theory of sampling for variables instead of for 
attributes. If a variable is the sum (or, within limita, some 
slightly more oomplicated function) of a large ntanher of other 
variables, then the distribution of the compound or resultant 
variable is normal, provided that the elementary variables are 
independent, or nearly so. The forms of the frequency-distribu- 
tions of the elementary variables affect the final distribution less 
and less as their number is increased : only if their number is 
moderate, and the distributions all eihibit a comparatively high 
degree of asymmetry of uniform sign, will the same sign of 
asymmetry be sensibly evident in the distribution of the compound 
variable. On this sort of hypothesis, the expectation of normality 
in the case of stature may be baaed on the fact that it is a highly 
compound character — depending on the staes of the bones of the 
head, the vertebral column, and the legs, the thickness of the 
intervening cartilage, and the curvature of the spine — the elements 
of which it is composed being at least to some extent independent, 
i.e. by no means perfectly correlated with each other, and their 
frequency-distributions exhibiting no very high degree of asym- 
metry of one and the same sign. The comparative rarity of 
normal distributions in' economic statistics is probably due in part 
to the fact that in moat cases, while the entire causation is 
certainly complex, relatively few causes have a largely predominant 
influence (hence also the frequent occurrence of irregular 
distributions in this field of work), and in part also to a high 
degree of asymmetry in the distributions of the elements on which 
the compound variable depends. Errors of observation may in 
general be regarded as compounded of a number of elements, dne 
to various causes, and it was in this connection that the normal 
curve was first deduced, and received its name of the curve of 
errors, or law of error. 

11. If it be deaired to compare some actual distribution 
with the normal distribution, the two distributions should be 
superposed on one diagram, as in fig. 49, though, of course, on 
a much larger scale. When the mean and standard-deviation 
of the actual distribution have been determined, ^^ ia given by 
equation (5) ; the fit will probably be slightly closer if the 
standard-deviation is adjusted by Sheppard's correction (Chap. 
XI. g"3). The normal curve is then moat readily drawn by plot- 
ting a scale showing fifths of the standard-deviation along the 
base line of the frequency diagram, taking the mean as origin, 
and marking over these points the ordinatea given by the figures 
of the table on p. 299, multiplied in each case by y,. The curve 



304 THEORY OF STATISTICS. 

can be drawn freehand, or by aid of a curve ruler, through the 
tope of the ordinatee so determined. The logarithms of y in the 
table on p. 299 are given to facilitate the multiplication. The onlj 
point in which the student is likely to find any difficulty is 
in the use of the scales : he must be careful to remember 
that the standard-deviation must be expressed in term* of tits 
clots-interval at a wnit in order to obtain for jiq a number of 
obtervatiotu per interval comparable with the frequencies of his 
table. 

The process may be varied by keeping the normal curve 
drawn to one scide, and redrawing the actual distribution 
so as to make the area, mean, and standard-deviation the 
same. Thus suppose a diagram of a normal curve was printed 
once for all to a scale, say, of }'o^5 inches, o- = l inch, and 
it were required to fit the distribution of stature to it. 
Since the standard-deviation is 2-57 inches of stature, the 
scale of stature is 1 inch = 2'57 inch of stature, or 0-.389 inches 
= I inch of stature ; this scale must be drawn on the base of the 
normal-curve diagram, being so placed that the mean falls 
at 6746. As regards the scale of frequency-per-interral, this 
is given by the fact that the whole area of the polygon showing 
the actual distribution must be equal to the area of the 
normal curve, that is 5 V2ir= 12'53 equare inches. If, therefore, 
the scale required is m observations per interval to the inch, 
we have, the number of observations being 8585, 



ft X 2-57 ' 

which gives n = 266 '6, 

Though the second method saves curve drawing, the first, 
on the whole, involves the least arithmetic and the simplest 
plotting. 

15. Any plotting of a diagram, or the equivalent arithmetical 
comparison of actual frequencies with those given by the 
fitted normal distribution, affords, of course, in itself, only a 
rough test, of a practical kind, of the normality of the given 
distribution. The question whether all the observed differences 
between actual and calculated frequencies, taken together, 
may have arisen merely as fluctuations of sampling, so that the 
actual distribution may be regarded as strictly normal, neglecting 
such errors, is a question of a kind that cannot be ai^wered in 
an elementary work (c/. ref. 19). At present the student is hi 
a position to compare the divergences of actual from calculated 
frequencies with fluctuations of sampling in the case of single 
class-intervals, or single groups of class-intervals only. If the 



THE BINOMIAL DISTKIBtJTlON AND THE NOBMAL CURVE. 305 

expected theoretical frequency in a oert ain interval is /, the 
standard error of sampling is "//(N -/)/N ; and if the divei^nce 
of the observed from the theoretical frequency eiceed aome 
three times this, standard error, the divergence ia unlikely to 
have occurred as a mere fluctuation of sampling. 

It should be noted, however, that the ordinate of the normal 
curve at the middle of an interval does not give accurately the 
area of that interval, or the nu'mber of observations within it : it 
would only do so if the curve were sensibly straight. To deal 
strictly with problems as to fluctuatione of sampling in the 
frequencies of single intervals or groups of intervals, we require, 
accordingly, some convenient means of obtaining the number of 
observations, in a given normal distribution, lying between any 
two values of the variable. 

16. If an ordinate be erected at a distance x/ir from the mean, 
in a normal curve, it divides the whole area into two parts, the 
ratio of which is evidently, from the mode of construction of the 
curve, independent of the values of y^ and of o-. The calculation 
of these fractions of area for given values of x/a-, though a long 
and tedious matter, can thus be done once for all, and a table 
giving the resulta'is useful for the purpi^e suggested in § 16 and 
in many other ways. Keferences to complete tables are cited at 
the cod of this work (list of tables, pp. 353-4), the short table below 
being given only for illustrative purposes. The table shows the 
greater fraction of the area lying on one side of any given ordinate ; 
e.ff. 0*53983 of the whole area lies on one side of an ordinate at 
O'lo-from the mean, and 046017 on the other side. It will be 
seen that an ordinate drawn at a distance from the mean equal to 
the standard-deviation cuts off some 16 per cent, of the whole 
area on one side ; some 68 per cent, of the area will therefore be 
contained between ordinatea at ± tr. An ordinate at twice the 
standard-deviation cuts ofi'only 2'3 per cent., and therefore some 
95-4 per cent, of the whole area lies within a range of ± 2<r. As 
three times the standard-deviation the fraction of area cut off is 
reduced to 135 parts in 100,000, leaving 99'7 per cent, within a 
range of ± So-. This is the basis of our rough rule that a range 
of 6 times the standard-deviation will in general include the 
great bulk of the observations : the rule is founded on, and is only 
strictly true for, the normal distribution. For other forms of 
distribution it need not hold good, though experience suggests 
that it more often holds than not. The binomial distribution, 
especially iSp and q be unequal, only becomes approximately normal 
when n is large, and this limitation must be remembered in applying 
the table given, or similar more complete tables, to cases in which 
the distribution is strictly binomial. niolc 



\^ 



THEORY OF STATISTICS. 



Tablb shmnin^ the Oreaitr R-actiim oflht Area of a Nttrmdl Ouree 
Sidt of on OrdinaU of AUdua xja. {For r^/ereiicea to n 
UMta, tte liU on pp. S53-4.) 





l>reaUr 




Qreater 


x/g. Fi 


action of 


i/», Fr 






Area. 




Area. 





60000 


2-1 


98214 


0-1 


G3SS3 


2-2 


98610 


0-2 


£7626 


2-3 


98928 


0-3 


81791 


2-4 


9BI80 


0-4 


66642 


2-5 


99379 


0-6 


89146 


2 '8 


99634 


0-e 


72676 


2-7 


99653 


0-7 


75804 


2-8 


99744 


0-8 


7SS14 


2-9 


99813 


0-9 


81694 


3-0 


99386 


1-0 


84134 


3-1 


99903 


11 


88438 


3-2 


^9931 


1-2 


88498 


3-8 


99962 


1-3 


S0320 


3-4 


99986 


1-4 


91924 


3-5 


99977 


1-6 


'93319 


3-fl 


99984 


1-8 


94620 


3-7 


99989 


1-7 


95648 


3-8 


99993 


1-8 


98407 


3-9 


99995 


1'9 


97128 


4-0 


99997 


2-0 


97725 


4-1 


99998 



Interpolating, it is given approximately hy 



^0-6 + 0-1 



2425 1 
3229 i ' 



= 0-675<r. 



More exact interpolation gives the value 0-67448975(r. This result, 
again, is the foundation of the rough rule that the semi-inter- 
quartile range is usually some 2/3 of the standard-deviation : it is 
strictly true for the normal curve only. It may be noted that 
the constant 067448975 .... can be determined by processes of 
interpolation only, and cannot be expressed exactly, like the 
mean deviation, in terms of any other known constant, such 



It has beoome customary to use 0'674 . . . 
irror rather than the standard error itself t 



times the standard 
of the 



THB BINOMIAL DiaTEIBDTION AND THE NORMAL CURVE. 307 

unreliability of observed statistical results, and the term probable 
error is given to this quaotity. It should be noted that the word 
" probable " is hardly used in its usual sense in this connection : 
the probable error is merely a quantity such that we may expect 
greater and less errors of simple sampling with about equal /^ 
frequency, provided always that the distribution of errors is 
normal. On the whole, the use of the " probable error " has little 
advantage compared with the standard, and consequently little 
stress ia laid on It in the present work ; but the term is in constant 
use, ajid the student must be familiar with it. 

It is true that the "probable error" has a simpler and more direct 
significance than the standard error, but this advantage is lost as 
soon as we come to deal with multiples of the probable error. 
Further, the best modern tables of the ordinatea and area of the 
normal curve are given in terms of the standard-deviation or 
standard error, not in terms of the probable error, and the mul- 
tiplication of the former by 0-67i5, to obtain the probable error,A^ 
is not justified unless the distribution is normal. For very large 
samples the distribution is approximately normal, even though p , 
and q are unequal ; but this is not so for small samples, such as 
often occur in practice. In the case of small samples the use of 
the "probable error" is consequently of doubtful value, while the ^ 
standard error retains its significance as a measure of dispersion. 
The " probable error," it may be mentioned, is often stated after 
an observed proportion with the ± sign before it ; a percentage 
given as 20-5 + 2'3 signifying "205 per cent, with a probable 
error ot 23 per cent." 

. It an error or deviation in, say, a certain proportion ji only just 
exceed the probable error, it is as likely as not to occur in simple' 
sampling : if it exceed twice the probable error (in either direction), 
it is likely to occur as a deviation of simple sampling about 18 
times in 100 trials — or the odds are about 46 to 1 against its 
occurring at any one trial. For a range ot three times the probable 
error the odds are about 22 to 1, and for a range of tour times the 
probable error 142 to 1. Until a deviation exceeds, then, 4 times ^^ 
the probable error, we cannot feel any great confidence that it ia 
likely to be "significant." Itia simpler to work with the standard 
error and take +,3 times the standard error as the critical range : 
for this range the odds arc about 370 to 1 against such a devia- 
tion occurring in simple sampling at any one trial. 

18. The following are a few miscellaneous examples of the use 
of the normal curve and the table of areas. 

Example i. — A hundred coins are thrown a number of times. 
How often approximately in 10,000 throws may (1) exactly 65 
heads, (2) 65 heads or more, be expected^ ; l^,0(H>lc 



808 THEOEY OF STATISTICS. 

The atandard-deviation is ^/0■S >: 0-5 x 100 = 5. Taking the 
distribution as normal, y^ = 797 'S. 

The mean number of heads being 50, 65-60 = 30-. The 
frequency of a deviation of 3a- is given at once by the table (p. 299) 
as 797'9 x -0111 ..... =8-86, or nearly 9 throws in 10,000. A 
throw of 66 heads will therefore be expected about 9 times. 

The frequency of throws of 65 heads or more is given by the 
area table (p. 306), but a little caution must now be used, owing 
to the discontinuity of the distribution. A throw of 65 heads is 
equivalent to a range of 64'5-65'5 on the continuous scale of the 
normal curve, the division between 64 and 65 coming at 6i'5, 
64-5-50= +2'9(r, and a deviation of +2-9.(t or more, will only 
occur, as given by the table, 187 times in 100,000 throws, or, say, 
19 times in 10,000. 

Example ii. — Taking the data of the stature-distribution of fig. 
49 (mean 67'46, standard-deviation 2-67 in.), what proportion of 
all the individuals will be within a range of ± 1 inch of the 

1 inch =0 389o-. Simple interpolation in the table of p. 306 
gives 0-651 29 of the area below this deviation, or a more extended 
table the more accurate value 0-65136. Within a range of 
i:0-389o-thetraotionof the whole area is therefore 0-30272, or the 
statures of about 303 per thousand of the given population will lie 
within a range of ± 1 inch from the mean. 

Example iii. — In a case of crossing a Mendelian recessive by a 
heterozygote the expectation of recessive ofispring is 50 per cent. 
(1) l£ow often would 30 rocessives or more be expected amongst 50 
offspring owing simply to fluctuations of sampling 1 (2) How many 
offspring would have to be obtained in order to reduce the probable 
error to 1 per cent. ? 

The standard error pf the percentage of reoesstves for 50 
observations is 50 JTj^ = 1 -(fl . Thirty receasives in fifty ia 
a deviation of 5 from the mean, or, if we take thirty as representing 
29-6 or more, 4-5 from the mean; that is, 0-636.(r. A positive 
deviation of this amount or more occurs about 262 times in 1000, 
80 that 30 recessives or more would be expected in more than a 
quarter of the batches of 60 ofispring. We have assumed 
normality forrather a small value of n, but the result is sufiiciently 
accurate for practical purposes. 

As regards the second part of the question we are to have 

■6745x50Vi7«=l, 

n being the number of offspring. Thb gives « = 1137 to the 

nearest unit. , CuKT^Ic 



THE BINOMIAL DISTRIBUTION AMD THE NOKUAL CURVE. 309 

Example iv.— The diagram of fig. 49 Bhows that the number of 
statures recorded in the group " 62 in. and less than 63 " is 
markedly less than the theoretical value. Could such a difference 
occur owing to fluctuations of simple sampling; and if so, how 
often might it happen 1 

The aotual frequency recorded ie 169. To obtain the theoreti- 
cal frequency we may either take it as given roughly by the 
ordinate in the centre of the interval, or, better, uee the integral 
table. Remembering that statures were only recorded to the 
nearest ^ in., the true limits of the interval are 61^|— 62^|, or 
61-94-62-94, mid-value 62-44. This is a deviation from the 
mean (67"46) of 5-02. Calculating the ordinate of the normal 
curve directly we find the frequency 197'8, This is certainly, as 

evident from the form of the curve, a little too small. The 
interval actually lies between deviations of 4-52 in. and 5*52 
in., that is, l-759o-and 3-148cr. The corresponding fractions of 
area are 0-96071 and 0-98418, difference, or fraction of area 
between the two ordinates, 0-02347. Multiplying this by the 
whole number of observations (8585) we have the theoretical 
frequency 201-5. 

The difference of theoretical and observed frequencies is therefore 
32-5. But the proportion of observations which should fall into 
the given class is 0'023, the proportion falling into other classes 
0'977, and the standard error of the class frequency is accordingly 
^/0-023 X 0-977 x 8585 = 14-0. As the actual deviation is only 
2'32 times this, it could certainly have occurred as a fluctuation (^ 
sampling. 

The question how often it might have occurred can only be ■ 
answered if we assume the distribution of fluctuations of aampling 
to be approximately normal. It is true that p and q are very 
unequal, but then n is very large (8585) — so large that the 
difference of the chances is fairly small compared with Jnpq 
(about one-fifteenth). Hence we may take the distribution of 
errors as roughly normal to a first approximation, though a 
first approiimation only. The tables give 0-990 of the area 
below a deviation of 2-32ir, so we would expect an equal or 
greater deficiency to occur about 10 times in 1000 trials, or once 
in a hundred. 

EEFEBENCES. 
The Binomial Hacliiue. 
<1) Galtoh, Francis, Natwal Inher-Uanee ; Macmillan k Co. London, 18S9, 
(Mechanical method of forming a bioomial or normal distribuHoD, 
chap, v., p. 6S ; for Pearsoo's geueraliBcd roachine, see below, 
wf. 11.) 



THEOBY OF STATISTICS. 



Frequency Cnrres. 

For the eariy clsisicol memoirs on tlie normal curre or law of error 
by Laplace, Gauss, and others, see Todhunter's SUtory [Introduction : 
ref. 7). The literature of this subject ia too eitensire to enable ua to do 
more than cite a few of the more recent memoirs, oF which 5, 6, and 11 
are of fuudamental importance. The stadent will find other citationg 
inS, 7, and 12. 

(2) Charlieh, C. V. L., "Besearches into the Theory of Probability" (Com. 
municatitms /Tom Hit AsCronomical Observalory, Luud) ; Lund, 1806. 

{3) Edobwostk, F, Y., "On the Representation of Statistics by Mathema- 
tical Fonnul«,"./our. Boy, Slat, Sac, vol. liL, 1S98 ; vol. liii., 1899 ; 
and vol. liiii , 1900. 

(1) Edqbwohth, F. Y, , Article on the " Law of Error" in the Sneyciopaxita 
BrUanniea, 10th edn., vol. iiviii., 1902, p. 280. 

(6) Epoewobth, F. Y., "The Law of Error," Cambridge Phil. Tram., voL 

zl., 1904, pp. 38-GS, 113-141 (and. an appendix, pp. i-xiv, not 
printed in the Camirridge Phil, Tram. ). 
(t) EdokTVDBTH, F. Y., "The Qeueralised Law of Error, or Law of Great 
Numbers," y&ur. Ji//y. Slat. Soc.. vol.liii., Ifl06, p. 187. 

(7) Bdobwobth, F, Y., "On the Representation of Statutical Frequency by 

aGuiTe,'ViWr. Roy. Stal. Sue., vol. lii., 1907, p. 102. 
(B) Feobngr. G. T.. Kotlektivmaislehre (heranagegcben von G. F. Lipps) ; 
Eagehnann, Leipzig, 1H97. 

(8) Kaptitn, J. C, Sktio Frequeacy Curves in Biology and Slalistici ; 

Noordhoff, Groniagen ; Wm. Dawson & Sons, London, 1803. 

{10) Macalibtbe, Donald, '' The Law of the Geometric Mean," Froc Bay. 
Soe., ToL ixii., 1879, p. 367. 

(11} PbabsoH, Karl, "Skew Variatian in Homogeneous Material," Phil. 
Trans. Roy. Soc., Series A, vol. oliixvi., 1895, p. 313, 

(For the generalised binomial machine, see § 1. The memoir deals 
with curves derived from the general binomial, and trom a somewhat 
analogous series derived &om the case of sampling from limited 
material. Supplement to the memoir, ibid., vol. oxovii., 1901, p. 448. 
For a derivation of the same curves from a modified standpoint, 
ignoring the binomial and analogous distributions, cf. chap, r., ref. 12.) 

(12) Pearson Kabl, "Das FehJergesBts und seine Verullgoraeinerangen 
durch Fechnerund Pearson" : A Rejoinder, Biometrika, vol. iv., 1905, 
p. 169. 

(18) Shbpfabd, W. F,, "On the Application of the Theory of Error to Cases 
of Normal Distribution and Normal Correlation," Phil, TVofu. Boy. 
Soc,, Series A, vol. cxcii. , 1898, p. 101. (Includes a geometrical treat- 
ment of the normal curve.) 

(14) YvL£, O. v., "On the Distribution of Deaths with Age when the Causes 

of Death act cumulatively, and similar Frequency -distributions," 
Jour. Boy. Slat, Soc., vol. Ixxiii., 1910, n. 26. (A binomial distribu- 
tion with negative index, and the related curve, i.e. a special ease of 
one of Pearson's curves, ref. 11.) 

The Besolntion of a Bistrlbntion Gomponnded of two Normal 
Curves Into its Oomponents. 

(15) Pearson', Karl, "Contributions to the Mathematical Theory of Evolu- 

tion (on the Dissection of Asymmetrical Frequency Curves)," Pkil. 
Tram. Bey. Soe., Series A, vol. ctxxir., 1884, p. 71. 



THE BINOMIAL DISTKIBUTION AKD TBK NOBMAL CDRVB. 311 

(16) Edoeivobth, p. Y., "On the BepTeaent&tion of SUtiatics by Uathenu- 

ticiil FormuliE," partii, Jowr. Roy. Stat. Sae., vol. Ixu., 1868, p. 126. 

(17) Pbarson, Kabl, "On eome Applioationa of the Theory of Chance to 

Racial DUTerentiation," Phil. Hag., 6th Seriea, vol. L, IBDl, p. 110. 
(IS) Hblgdero, Fkbnando dk, "Per la rieoluzioiie dalle curve dimorfiche," 
Biom^rika. vol. ii., 1006, p. 230. Also memoir under the same title 
in the TransaotioDS of the Reale Accademia dei Lincei, Rome, vol. vL 
1006. (The first is a short note, the xecond the full memoir.) 

See also the memoir by Charlier, cited in (2), section vi of that 
memoir dealing with the problem of diesectlon. 

TestdiiK the Fit of a Theoretical to an Observed Diatribatlon. 

(19) Feabsdn, Karl, "On the Criterion tbat a given System of Deviations 
from the Probable, in the case of a Correlated System of Variables, is 
such that it can be reasonably supposed to have arisen from random 
sampling," FhU. Mag., July 1900, p. 167. 



1. Calculate the theoretical distribntiona for the three eiperimentol ombb 
(1). (3), and (3) cited in § 7 of Chapter XIII. 

2. Show that if np be a whole number, the mean of the binomial coincidea 
with the greatest U 

" "^ow that if 

lame number _. , ^--r ■— — 

e coincides with the (r -t- 1 )th term of the other, the distribution 
formed by adding superposed terms is a symmetrical binomial of degree « + 1. 

[Note : it follows that if two normal distributions of the same area and 
standard -deviation are superpoEed so that the difference between the means is 
small compared with the standard -deviation, the compound curve ia vei; 
nearly normal.] 

i. Calculate the ordinatca of the binomial 1024 (0*5-1- '&)*", and compare 
them with those of the normal curve. 

G. Draw a diagram showing the distribution of statures of Cambridge 
Stndenti (Chap. VII., Table VII.), and u normal curve of the same area, 
mean, and etandard'deviation superposed thereon. 

6. Compare the values of the semi -interquartile range for the stature 
dtstributdons of male adults in the United Kingdom and Cambridge Students, 
(1) as found directly, (3) as calculated from the standard deviation, on the 
assumption that the distribution is normal. 

7. Taking the mean stature for the British Isles as 67'46 in. (the dis- 
tribution of fig. 49), the mean for Cambridge students as 68'8E in,, and the 
common standard-deviation as 2 '66 in. , what percentage of Cambridge students 
exceed the British mean in stature, assuming the distribution normal 1 

S. As sUted in Chap. XIII., Example iL , certain crosses of Pimmaaiivuoi 
based on 7125 seeds gave 25'S2 per cent, of green seeds instead of the theoiatical 
proportion 26 per cent., the standard ettor Deing O'Sl per cent. In what pei- 
centog* of eiperiments based on the same number of seeds might an equal or 
greatar percentage b« expected to occur awing to floctoatlons of umpling 

9. In what proportion of similar eiperimenta based on (1) 100 seeds, (2) 
1000 seeds, might (a) 30 per cent or more, (b) SE per cent, or more, of green 
seeds, be expected to oocor, if ever t 



312 THBOK? OF STATISTICS. 

10. In liniUaT sTperimeute, «hat number of Boeda most be obtained to 
make the " probable error" of the proportion 1 per cent. 1 

11. If ekulla are olasaified as dolic/u>eephitlii! when the length-breadth 
index is under 7S, mtxocephalic when the Hiue index lies between 7S and 80, 
and braehyeephalie vhen the index is over BO, find approximately (assuming 
that the lUatribution is normal) the mean and standard- deviation of a series 
in whieh SB per cent, are stated to be doliohooephalic, 38 per cent, meso- 
cephalio, and i per cent. bnchToephalio. 



N Google 



CHAPTER XVI. 

NORMAL COBBELATION. 

1-3. Deduction of the gsDeral expression for the norm&l coTralstioo surface 
from the cose of independence— 4. Constancy of the stundard- 
deriatioDS of parallel arrays and linearity of the regression— &. The 
contour lines: a seiiea of concentric and similar ellipses — 6. Tbs 
normal surface for two correlated variables regarded as a normal 
aarfkra for uncorrelated variables rotated with respect to the axes of 
meaenrement : arrays taken at any angle across the surface are normal 
distribntions with constant standard-deviation ; distribution of and 
correlation between linear functions of two normally correlated 
TariableB are normal : principal aies— 7. Standard-deviations round 
the principal axes — 8-11. Investigation of Table III., Chap. IX., to 
test normality : linearity of reffression, constancy of atandard-de via tion 
of arrajB, normality of diatrioution obtained by diagonal addition, 
contour lines — 12-13. Isotropy of the normal distribution for two 
variables— 14. Outline of the principal properties of the normal dis- 
tribution for n variables. 

1, Tub expression that we have obtained for the " normal " dis- 
tribution of a single variable may readily be made to yield a 
corresponding expression for the distribution of frequency of pairs 
of values of two variables. This normal distribution for two 
variables, or "normal correlation surface," is of great historical 
importance, as the earlier work on correlation is, almost with- 
out exception, based on the assumption of such a distribution ; 
though when it was recognised that the properties of the correla- 
tion-coefficient could be deduced, as in Chap. IX , without reference 
to the form of the distribution of frequency, a knowledge of 
this special type of frequency-surface ceased to be so essential. 
But the generalised normal law is of importance in the theory of 
sampling : it serves to describe very approximately certain actual 
distributions (e.g. of measurements on man) ; and if it can be 
assumed to hold good, some of the expressions in the theory of 
correlation, notably the standard-deviations of arrays (and, if 
more than two variables are involved, th6 partial correlation- 
coefficients), can be assigned more simple and definite meanings 
than in the general case. The student should, therefore, be 
familiar with the more fundamental properties of the distribution. 



314 THSOBY OF BTATIBTICS. 

2. Consider first the case in which the two variables are com- 
pletely iadependent. Let the distributions of frequency for the 
two variables x^ and x^, singly, be 



Then, assuming independence, the frequency-distribution of pairs 
of values must, by the rule of independence, be given by 



<i4 



■ m 



Equation (2) gives a normal correlation surface for one special 
case, the correlation-coefficient being zero. If we put x, = a con- 
stant, we see that every section of the surface by a vertical plane 
parallel to the x, axis, i.e. the distribution of any array of x^'s, ia 
a normal dbtribution, with the same mean and standard-deviation 
as the total distribution of x^'s, and a similar statement holds for 
the array of x^'b ; these properties must hold good, of course, as 
the two variables are assumed independent {ef. Chap. V. § 13). 
The contour lines of the surface, that is to say, lines drawn on 
the surface at a constant height, are a series of similar ellipses 
with major and minor axes parallel to the axes of x^ and x^ and 
proportional to o-j and o-^, the equations to the contour lines being 
of the general form 

5 + ^-C- . . . . (4, 

Pairs of values of x-, and ij related by an equation of this form 
are, therefore, equally frequent. 

3. To pass from this special case of independence to the general 
case of two correlated variables, remember (Chap. XU. § 8) 

that if 



X, and Xj,„ as also x^ and Xj.g are uncorrelated. If they are not 
merely uncorrelated but completely independent, and if the, dia- 



NORMAL CORBELATION. 



tributiou of each of the deviatioDB singly be normal, we must have 
for the frequeflcy-diatribution of pairs of deviations of a;, and %i 

-'(•;40 . . . (6) 

But 



ai itI, ai{l-*j^ <rl(l-Ki) '\<^30-*u) 



Evidently we would also have arrived at precisely the same 
eipression if we had taken the distribution of frequency for x^ 
and iCjj, and reduced the exponent 



Further, since x^ and a;^.,, x^ and Xy^, are independent, we must 

^"^ 2T.<r,,(ri, 2w.<T^a;,j 27r.a-,.CT,(l -rj,)' ' "' 

4. If we assign to x^ some fixed value, say k^, we have the 
distribution of the array of x^'s of type Aj, 



{■-<'■)' 



This is a normal distribution of standard-deviation o-j.j, with a 
mean deviating by rj^-^-k^ from the mean of the whole distribu- 
tion of Xji. As kj represents any value whatever of x^, we see 
(1) that the standard-deviations of all arrays of a:jaa^,t£e,^B^e, 



316 



THKOKY OF STATISTICS. 



and equal to o-i., : (2) that the regression of x^ on x^ is strictly 
linear. Similarly, of courae, if we assign to ;r, any value Aj, we 
will find (1) that the standard-deviations of all arrays of x^ are 
the same : (2) that the regresaion of «, on a^i is strictly linear. 



Axet of MfSttitmenC 



M ■ Wean of whole surface 
and is also the summit of 
the surface 
RP.CC.-Lines of means 




5. The contour lines are, as in the case of independence, a 
series of concentric and similar ellipses ; the major and minor 
axes are, however, no longer parallel to the axes of x-, and x^ but 
make a certain angle with them. Fig. 60 illustrates the calcu- 
lated form of the contour lines for one oase, RU and CC being 
the lines of regression. As each line of regression cuts every 



NORMAL COBBIUTION. 317 

array of x^ or of x^ in ite mean, and as the distribution of every 
array is ay m metrical about its mean, RR must bisect every 
horizontal chord and CC every vertical chord, as illustrated 
by the two chords shown by dotted lines : it also follows that 
RR cuta all the ellipses in the points of contact of the horizontal 
tangents to the ellipses, and CC in the points of contact of 
the vertical tangents. The surface or solid itself, somewhat 
truncated, is shown in fig. 29, p. 166. 

6. Since, as we see from fig. 50, a normal surface for two 
correlated variables may be regarded merely aa a certain surface 
for which r is zero turned round through some angle, and since 
for every angle through which it is turned the distributions of all 
atj arrays and x^ arrays are normal, it follows that every section 
of a normal surface by a vertical plane is a norma] curve, i.e. the 
distributions of arrays taken at any angle across the surface are 
normal. It also follows that, since the total distributions of x^ 
and x^ must be normal for every angle though which the surface 
is turned, the distributions of totals given by slioea or arrays 
taken at any angle across a normal surface must be normal 
distributions. But these would give the distributions of functions 
like a.x^±b,x^ and consequently (1) the distribution of any 
linear function of two normally distributed variables a:j and x^ 
must also be normal ; (2) the correlation between any two linear 
functions of two normally distributed variables must be normal 
correlation. 

To find the angle 6 through which the surface has been turned, 
from the position for which the correlation is zero to the position 
for which the coefficient has some asa^ed value r, we must use 
a little trigonometry. The major and minor axes of the ellipses 
are sometimes termed the principal axes. If ^j, f, be the co- 
ordinates referred to the principal axes (the fj-aiis being the 
Xj axis in its new position) we have for the relation between £„ 
f~ «[, Xj, the angle 6 being taken as positive for a rotation of 
the Xj-aita which will make it, if continued through 90°, coincide 
ID direction and sense with the a^^-aiia, 



f,=3T,. cos 6+x^. sin ^ 1 
^2 = ^2. cos 0- Xy sin $ f 



(8) 



But, since f ^ i^ are uncoirelated, 2{fji^) = 0. Hence, multiplying 
together equations (8) and summing. 



0.(i!-iH)s; 






318 THEOBT OF STATI8TICS. 

It should be noticed that if we defiTie the principal axes of any 
distribution for two variables as being a pair of axes at right 
angles for which the variables ^,, $^ are uncorrelated, equation 
(9) gives the angle that they matce with the axes of measurement 
whether the distribution be normal or no. 

7. The two standard-deviations, say Sj and S^, about the 
principal axes are of some interest, for evidently from g 2 the 
major and minor axes of the contour-ellipseB are proportional 
to these two standard-deviations. They may be moat readily 
determined as follows. Squaring the two transformation equations 
(8), summing and adding, we have 

5^-f2| = <rf-)-ff| .... (10) 

Referring the surface to the axes of measurement, we have for 
the central ordinate by equation (T) 



Referring it to the principal axes, by equation (3) 



But these two values of the central ordinate must be equal, 

therefore 

2,2,= <r,<r,(l-»^i,)' . . . <11) 

(10) and (11) are a pair of simultaneous equations from which 
Xj and Sg may be very simply obtained in any arithmetical case. 
Care must, however, be taken to give the correct signs to the 
square root in solving. 2j -n 2^ is necessarily positive, and 2, - ^ 
also if r is positive, the major axes of the eltipaes lying along (^ : 
but if r be negative, 2, — 2j is also negative. It should be noted 
that, while we have deduced (11) from a simple consideration 
depending on the normality of the distribution, it is really of 
general application (like equation 10), and may be obtained at 
somewhat greater length from the equations for transforming 
co-ordinates. 

8. As stated in Chap. XV. § 13, the frequency-distribution 
for any variable may be expected to be approximately normal 
if that variable may be regarded as the sum (or, within limits, 
some slightly more complex function) of a large number of other 
variables, provided that these elementary component variables 
are independent, or nearly so. Similarly, the correlation between 
two variables may be expected to be approximately nffnoal if 



NORMAL CORRELATION. 319 

each of the two variablee may be regarded as the sum, or some 
slightly more complex function, of a large number of elementary 
component variables, the intensity of correlation depending on 
the proportion of the componente commoD to the two variables. 

Steture is a highly compound character of this kind, and we 
have seen that, in one instance at least, the distribution of stature 
for a number of adults is given approximately by the normal 
curve. We can now utilise Table III., Chap. IX., p. 160, showing 
the correlation between stature of father and son, to test, as far 
as we can by elementary methods, whether the normal surface 
will fit the distribution of the same character in pairs of indi- 
viduals : we leave it to the student to test, as far as he can do so 
by simple graphical methods, the approximate normality of the 
total distributions for this table. Tlie first important property 
of the normal distribution is the linearity of the regression, and 
this was illustrated in fig. 37, p. 174. It is evident that the 
means of arrays deviate slightly here and there from the lines 
of regression, but there is no marked and regular departure from 
linearity— no suggestiou of a smooth and sweeping curve. 
Subject to some investigation as to the possibility of the devia- 
tions that do occur arising as fluctuations of simple aauipling, 
when drawing samples from a record for which the regression 
is strictly linear, we may conclude that the regression is 
appreciably linear. 

9. The second important property of the normal distribution 
for two variables is the constancy of the standard-deviation for 
all parallel arrays. We gave in Chap. X. p, 204 the standard- 
deviations of ten of the columns of the present table, from the 
column headed 62'5-635 onwards ; these were^ 

2-56 2-60 

2-11 2-26 

2 55 2-26 

2-24 2-45 

2-23 2'33 

the mean being 2'36. The standard-deviations again only fluctuate 
irregularly round their mean value. The mean of the first five 
ia 2'34, of the second five 2'38, a difference of only 0'04 : of the 
first group, two are greater and three are less than the mean, 
and the same is true of the second group. There does not seem 
to be any indication of a general tendency for the standard- 
deviation to increase or decrease as we pass from one end of the 
table to the other. We are not yet in a. position to test how 
far the differences from the average standard- deviation might 
arise in sampling from a record in which the distribution was 



320 THEORY OP STATISTICS. 

Btrictljr normal, but, as a fact, a rough test suggests that the; 
might have done so.- 

10. Next we note that the distributioas of all arrays of a 
normal surface should themselves be normal. Owing, however, 
to the small numbers of observations in any array, the distributions 
of arrays are very irregular, and their normality cannot be tested 
in any very satisfactory way : we can only say that they do not 
eihibit any marked or regular asymmetry. But we can test the 
allied property of a normal correlation-table, viz. that the totals 
of arrays must give a normal distribution even if the arrays be 
taken diagonally across the surface, and not parallel to either 
axis of measurement {c/. § 6). From an ordinary correlation- 
table we cannot find the totaJs of such diagonal arrays exactly, 
but the totals of arrays at an angle of 46° will be given with 
sufficient accuracy for our present purpose by the totals of lines 
of diagonally adjacent compartments. Referring again to Table 
III., Chap. IX., and forming the totals of such diagonals (running 
up from left to right), we find, starting at the top left-hand 
comer of the table, the following distribution : — 



0-25 


78-75 


2 


81-25 


3.25 


67-5 


6-26 


59-25 


8 


. 42-25 



67-5 
86-75 

87-25 



Total 1078 

The mean of this distribution is at 0'368 of an interval above the 
centre of the interval with frequency 78; its standard-deviation 
is 4-755 intervals, or, remembering that the interval is 1/ ^2 of 
an inch, 3-362 inches. (This value may be checked directly from 
the constants for the table given in Chap. IX., Question 3, p. 189, 
for we have from the first of the transformation equations (8), 

<r| = a^. cos^ fl -H dl sin= fl -f- 2r,B<T,ff2. sin $ cos 6, , 

og\c 



NOHMAL COREBLATION. 



321 



and tii8ertmgcri = 2-72, <rj=2-75, r,j = 0-51, Bin fl-cOB tf-l/VS 
find trf^3'361). Drawing a diagram and fitting a normal 
curve we have fig. 61 ; the distribution is rather irregular but the 
fit is fair ; certainly there is no marked aaymmetry, and, BO far as 
the graphical test goes, the distribution may be regarded as 
appreciably normal. One of the greatest divergences of the' 
actual distribution from the normal curve occura in the almost 
central interval with frequency 78 ; the difference between the 
observed and calculated frequencies ia here 12 units, but the 
standard error ia 9'1, so that it may well have occurred as a 
fluctuation of simple sampling. 























































/ 


































/ 






\ 






























/• 








1 


























1 










\ 














10 












/ 










' 
























■1 














\ 












■>o 










/ 


































/ 


















V 










o 




























^ 


^ 







Fic. 51. — DiitributioQ of Prequenoy obtained by luldition al Table lit., 
Chap. IX , along Diagonals ninning up from left to right, fitted with a 
Normal Gurre. 

11. So far, we have seen (1) that the regression is approxi- 
mately linear; (2) that, in the arrays which we have tested, the 
standard-deviationa are approximately constant, or at least that 
their differences are only small, irregular and fluctuating ; (3) that 
the diatribution of totals for one set of dit^onal arraj^ is approxi- 
mately normal. These results suggest, though they cannot 
oomplet«ly prove, that the whole distribution of frequency may 
be regarded as approximately normal, within the limits of fluctu> 
ations of sampling. We may therefore apply a more searching 
test, viz. the form of the contour lines and the closeness of their 
fit to the contour-ellipses of the normal surface. We can see at 
once, however, that no very close fit can be expected. Since the 
frequencies in the compartments of the table are small, the 
standard error of any frequency ia given approximately by its 

21 



322 THEORY OF STATISTICS. 

square root (Chap. XIII. g 12), and this implies a standard error 
of about 5 uuits at the centre of the table, 3 units for a frequency 
of 9, or 2 units for a frequency of i : such fluctuations might 
cause wide divergences in the corresponding contour lines. 

Using the suffix 1 to denote ibe constants relating to the 
distribution of stature for fathers, and 2 tbc same constants for 
the sons, 

JV=1078 ir, = 6770 ifj=68-66 
<r,- 2-72 (rj= 2-75 

Hence we have from equation (7) 
y'ij = 26-7 
and the complete expression for the fitted normal surface is 



„.0-51 



y-26-7."''-"*' 



The equation to any contour ellipse will be given by equating 
the index of e to a constant, but it is very much easier to draw 
the ellipses if we refer tbem to their principal axes. To do this 
we must first determine 0, 2^ and S^. From (9), 

tftn2tf= -46-49, 
whence 20 = 91* 14', = 45° 37', the principal axes standing very 
nearly at an angle of 45° with the axes of measurement, 
owing to the two standard-deviations being very nearly equal. 
They should be set off on the diagram, not with a protractor, but 
by l^ing tan $ from the tables (1'022) and calculating points on 
each axis on either side of the mean. 

To obtain 2, and Z^ we have from (10) and (1 1) 

25 + 23=14-961 

22, 2,= 12-868 
Adding and subtracting these equations from each other and 
taking the square root, 

2iH-2g = 5-275 

2,-2j=l-447 

whence 2| = 3-36, 2^ = 1-91; owing to the principal axes stand- 
ing nearly at 46° the first value is sensibly the same as that found 
for trj in § 10. The equations to the contour ellipses^ referred to 
the principal axes, may therefore be written in the form 

(3-36)"*(l-91)' '^' ,, . 



NORMAL CORRELATION. 323 

the major and minor axes being 3'36 x c and 1 '91 x e. reapectiTely. 
To find c for any aseigned value of the frequenoy y we have 

^, 2(logy'„-logy„)^ 

loge 

Supposing that we desire to draw the three contour-ellipses for 
y = 5, 10 and 20, we lind c= 1-83, 140 and 0-76, or the following 



\ 




1 1 












/ 




\ 


~^^'^'- 










/ 


/ 






^o^^>=i 


\ 






/ 


/ 






:(/ i^ T^ 


\ 


/ 










- /"^^c^Ok 


\ 








x\ mv^kh^ 


Vx 


— 




\ 


V \ \^r W V ) 


/ 






"^ 


\\ W\ V V 


v\ 










^ 


>L^-^3^ V 










/ 


"S 


x^ X// I 


— 






/ 








v/"-~=i5^'\i 




/ 










\><=p^^ — ^\ 




^ 




^ 








^ 


L 




\ 


^ 


u 


u 



SUOwre of FaQter : india 
Fio. 62.— CoDtoDT Lines for the Fraqaeoeiee fi, 10 and 20_of the diatribntian 



Talues for the major and minor axes of the ellipses : — eemi-m^or 
axes, 615, 470, 2-55: semi-minor axes, 3-50, 2'67, 1-45. The 
ellipses drawn with these aies are shown in lig. 52, very much 



324 THEOBY OP STATISTICS. 

reduced, of course, from the original drawii^, one of the Bquares 
ehowD representing a square inch on the original. The actual 
ooDtour lines for the same frequencies are shown by the irregular 
polygons superposed on the ellipses, the points on these polygons 
having been obtained by simple graphical interpolation between 
the frequencies io each row and each column — diagonal interpola- 
tion between the frequencies in a row and the frequencies in a 
column not being used- It will be seen that the fit of the two 
lower contours is, on the whole, fair, especially considering the 
high standard errors. In the case of the central contour, y = 20, 
the fit looks very poor to the eye, but if the ellipse be compared 
carefully with the table, the figures suggest that here again we 
have only to deal with the effects of fluctuations of sampling. 
For father's stature = 66 in., son's stature = 70 in., there ia 
a frequency of 18'75, and an increase in this much less than the 
standard error would bring the actual contour outaide the ellipse. 
Again, for father's stature = 68 in., son's 8tBture = 7I in., there 
is a frequency of 19, and an increase of a single unit would give 
a point on the actual contour below the ellipse. Taking the 
results as a whole, the fit must be regarded as quite as good as 
we could expect with such small frequencies. It ia perhaps of 
historical interest to note that Sir Francis Ualton, working with- 
out a knowledge of the theory of normal correlation, suggested 
that the contour lines of a similar table for the inheritance of 
stature seemed to be closely represented by a series of concentric 
and similar ellipses (ref. 2) : the suggestion was confirmed when 
he banded the problem, in abstract terms, to a mathematician, 
Mr J. D. Hamilton Dickson (ref. 4), asking him to investigate 
"the Surface of Frequency of Error that would result from 
these data, and the various shapes and other particulars of its 
sections that were made by horizontal planes" (ref. 3, p. 102). 

12. The normal distribution of frequency tor two variables is 
an isotropic distribution, to which all the theorems of Chap. V. 
gg 11-12 apply. For if we isolate the four compartments of the 
correlation-table common to the rows and columns centring 
round values of the variables a;,, x^ x^, a^ we have for the ratio 
of the cross-products (frequency of x^ iCj multiplied by frequency 
of a:i, iTa divided by frequency of x, x^ multiplied by frequency of 

;^( ';-.,)( ^-^}- 



Assuming that Xi — Xj has been taken of the same sign as x'^-x^ 
the exponent is of the same sign as r,,. Hence the association for 



NORMAL CORRELATION. 325 

this group of four frequeDcies is also of the aame sign aa r,,. the 
ratio of the cross-prod uctB being unity, or the aasociation zero, 
if rj3 is zero. In a normal distribution, the association is therefore 
of the same sign — the sign of r,j — for every tetrad of frequencies 
in the compartments common to two rows and two columns ; that 
is to say, the distribution is isotropic. It follows that every 
grouping of a normal distribution is isotropic whether the clasa- 
intervals are equal or unequal, large or small, and the sign of the 
association for a normal distribution grouped down to 2- x 2-fold 
form must always be the same whatever the axes of division 
chosen. 

These tlieorems are of importance in the applications of the 
theory of normal correlation to the treatment of qualitative 
characters which are subjected to a manifold classification. The 
contingency tables for such characters are sometimes regarded as 
groupings of a normal distribution of frequency, and the coefGcient 
of correlation is determined on this hjrpothesis by a rather lengthy 
procedure (ref. 14), Before applying this procedure it is well, 
therefore, to see whether the distribution of frequency may he 
regarded as approximately isotropic, or reducible to isotropic form 
by some alteration in the order of rows and columns (Chap. V. 
§§ 9-10). If only reducible to isotropic form by some rearrange- 
ment, this rearrangement should be effected before grouping the 
table to 2- X 2-toId form tor the calculation of the correlation 
coefficient by the process referred to. If the table is not reducible 
to isotropic form by any rearrangement, the process of calculating 
the coefBcient of correlation on the assumption-of normality is to 
be avoided. Clearly, even if the table be isotropic it need not be 
normal, but at least the test for isotropy affords a rapid and 
simple means for excluding certain distributions which are not 
even remotely normal. Table II. of Chap. V. might possibly be 
regarded as a grouping of normally distributed frequency if re- 
arranged as suggested in § 10 of the same chapter — it would be 
worth the investigator's while to proceed further and compare 
the actual distribution with a fitted normal distribution — but 
Table IV. could not be regarded as normal, and could not be 
rearranged so as to give a grouping of normally distributed 
frequency. 

13. If the frequencies in a contingency-table be not large, and 
also if the contingency or correlation be small, the influence 
of casual irregularities due to fluctuations of sampling may 
render it difficult to say whether the distribution maybe regarded 
as essentially isotropic or no. In such oases some further con- 
densation of the table by grouping together adjacent rows and 
columns, or some process of "smoothing" by averting 'tile 



326 



l^HKORY OF STATISTICS. 



1 adjacent compartments, may be of service. The 
correlation- table for stature in father and son (Table III., Chap. 
IX.), for inetance, is obviously not striotly isotropic as it stands : 
we have seen, however, that it appears to be normal, within the 
limits of fluctuations of sampling, and it should consequently be 
isotropic within such limits. We can apply a rough test by 
regrouping the table in a much coarser form, say with four rows 
and four columns ; the table below exhibits such a grouping, the 
hmits of rows and of columns having been Eo fixed as to include 
not less than 200 observations in each array. 

Table I.— (condepsed from T»bl« III. of Chapter IX.). 



Son'B Stature 
Cinehe*). 


Fsther's SUture (inches). 


Undar 
S5-5. 


66-5-67 -6 


67 -5-69 -5. 


69-6 
and over. 


Total. 


Under 66-5 
66-6-68-6 

a8-B-70-5 
70-E and over 


97-6 
76-5 
83-25 
li'76 


74 -26 

108 
6* -75 
32 6 


81-75 

85 
95 

80-76 


10-6 
52 
84 -E 

134 


217 
821-6 

. 277-5 
262 


Total 


222 


279 -5 


295-6 


281 


1078 



Taking the ratio of the frequency in col. 1 to the sum of the 
frequencies in cols. I and 2 for each successive row, and eo on for 
the other pairs of columns, we find the following series of ratios : 



Row. 


OolumnB 


1 and 2. 


2 and 3. 


8 and 4. 


1 
2 
S 

4 


0-568 
0-416 
339 

0-S12 


0-681 
0-560 
0-406 
0-287 


0-768 
0-620 
0-629 
0-378 



These ratios decrease continuously as we pass from the top to the 
bottom of the table, and the distribution, as condensed, is therefore 



NORMAL COBBELATION. 327 

iBotropio. The student ehould form one or two other condeoBations 
of the original table to 3- x 3- or 4- x 4-fold form : he will probably 
find them either isotropic, or diverging ao slightly from isotropy 
that an alteration of the frequencies, well within the margin of 
posaible fluctuations of sampling, will render the distribution 
isotropic. 

14. Before concluding this chapter we may note briefly some 
of the principal properties of the normal distribution of frequency 
for any number of variables, referring the student for proofs to 
the original memoirs. Denoting the frequency of the combination 
of deviations x^, *j, afj, . . , , a:„ by y,, . . . . „ we must have 
in the notation of Chapter XII., if the uncorrelated deviations Xj, 
^t-K %!!> ^^- ^ completely independent (c/. § 3 of the present 
chapter), 

m»i»a ■ ■ ■ . «») ■ ■ (12) 

*(^A...,--)=3+i:+3^+ ■ ■ ■ ■ + $:----r': 03) 



where 



iiT" 



(14) 



The ezpreeaion (13) for the exponent ^ may be reduced to i 
general form corresponding to that given for two variables, viz.— 



Several important results may be deduced directly from the form 
(13) for the exponent. Clearly this might have been written in 
a great variety of ways, commencing with any deviation of the 
firet order, allotting any primary subscript to the second deviation 
(except the subscript of the first), and so on, just as in § 3 we 
arrived at precisely the some final form for the exponent whether 
we started with the two deviations x, and x^.^ or with X2 and x^^ 
Our assumption, then, that the deviations iCj, Kji, «j,jj, etc, are 
normally distributed amounts to the a^umption that all devia- 
tions of any order and with any suffixes are normally distributed, 
i.e. in the general normal dUtHbutum for n variable! every array ^ ^ 
of every order it a normal distribution. It will also follow, gen- 
eralising the deduction of § 6, that any linear function of x,, x^ 
... .a:, is normally distributed. Further, if in (13) any fixed 



828 TREOBr OP STATISTICS. 

values be aasigDed to x^-^^ and all the following deviations, the 
correlation between x^ and x^ on expanding a:, j, is, as we have 
seen, normal correlation. Sirailarly, if any fixed values be 
assigned to ;E|, to x^.^^ and all the following deviations, on 
reducing ^c^jj to the second order we shall find that the correla- 
tion between %, and x^^^ ia normal correlation, the correlation 
coefficient being r^„ and so on. That is to say, using & to 
denote any group of secondary suffixes, (1) the eorrdaliofi between 
any two deviationi x^^ and x^^ it normal correlation ; (2) the correla- 
tion, betmeen the aaid deviationt it r^n.! fhatever the particular 
fixed valuei aitifftied to the remaining deviations. The latter 
conclusion, it will be seen, renders the meaning of partial 
correlation coefficients much more definite in the case of normal 
correlation than in the general case. In the geueraJ case r,,^^ 
represents merely the average correlation, so to speak, between 
a:„.it And x^\ in the normal case r^^t is constant for all the sub- 
groups corresponding to particular assigned values of the other 
variables. Thus in the case of three variables which are normally 
correlated, if we assign any given value to x^, the correlation 
between the associated values of x, and x^ is rjg, : in the general 
case r^j^g, if actually worked out for the various sub-groups 
corresponding, say, to increasing values of x^ would probably 
exhibit some continuous change, increasing or decreasing as the 
case might be, Finally, we have to note that if, in the expression 
(16) for ^ we assign fixed values, saj h^, h^, etc., to all the 
deviations except x^, and then throw tfi into the form of a perfect 
square (as in g 4 for the case of two variables), we obtain a normal 
distribution for x^ in whicb the mean is displaced by 

"•l-M... ,! , 

But this is a linear function of ^, k^ etc., therefore in the case of 
normal correlation the rtffreision of any one variable on any or all 
of the othert ii ttrictly linear. The expressions rj^,, . . . . » . 
■'i.sa , . . . n/Tai3 . . . . n. ctc. are of course the partial regressions 

REFEBETTCE8. 
Oeneral. 

(1) Bbavais, a., " Analyse matb^matique BUt les probability det erreun da 

sitDHtion d'un ■point," Acad, det ScteiieeM : MimoireM preaenUs par divert 
$avanls, II* B^rie, it., 1846, p. 266. 

(2) Galtos, Fjiancis, " Family Likeness in Stature," Proc. Bay. Soc, vol. il., 

ISSe, p. 12. 
(8) Oaltok, FttANOls, Natural InherUanee; Macmillan &Co., 1889, j . 



NORMAL CORBKL&TION. 329 

{*) DiOKBON, J. D. Hamilton, Appendix to (2), Proe. Soy. Soc , vol, i1., 

isee, p. 63. 
(S) Edoewobth, F. Y., "On Correlated Avei'sges," Phil. Mag., 6th Seriee, 

vol. iiiiv., 1892, p. 190. 
{«) Pbabson, Kael, " RegreBsioa, Heredity, and Panmiiia," Phil. Trans. 

Soy. Soc., Series A, vol. clxxxvii., 1B96, p. 253. 

(7) PsAGaoH, Karl, " On Licea and Planea of Gloseat Fit to Systems of Points 

in Space," /%«t Mag., 8th Series, toI. ij., IBOl, p. 66B. (On the fitting 
of " principal axes" and the corresponding planes in the case of more 
than two Tariables.) 

(8) Peabson, Kabl, "OntheluHaeDceorNaturalSelectiononthe Variability 

and CorrelatioD of Organs," PAtf. Traia. Roy. Soc., 8ene» A, vol. cc, 
1902, p. 1. (Based on the assumption of normal correlation.) 

(9) Peabson, Karl, and Alicr Les, "Un the Generalised Probable Error in 

Multiple Normal Correlation." Biomttrika. vol. vi. , 190S, p. G9. 

(10) YtiLK, G. v., "On the Theory of Correlation," Jour. Roy. Slat. Soc., 

vol. Ix., 18B7, p. 812. 

(11) YvLB, O. U., " On the Theoiy of Correlation for any number of Variables 

treated by a New System of'^Notation," Pror.. Bay. Soc, Series A, toI. 
Ixxix., 1907, p. IS2. 

(12) Shbppard, W. F., " On the Application of the Theory of Error to Cases 

of Normal Distribution and Normal Correlation," Phil, Trata. Soy. 
Soc., Series A, vol. cioii., 1898, p. 101. 

(13) Sbbppard, W. F.. "Oti the Calculation of the Donble-inte^l 

expressing Normal Correlation," Camhidge PkiL Tram., vol. xix., 
1»00, p. 23. 

Applications to the Theonr of Attribntes, etc. 

(14) Pkabsoh, Eabl, "On the Correlation of Charsctera not Quantitatively 

Measurable," Phil. Tram. Soy. Soc., Series A, vol. cxcv., IBOO, p. 1. 

(15) Peassok, Earl, " On a New Method of Uetermitiing Correlation between 

a Measured Character A and a Character £, of which only the Percent- 
age of Cases wherein B exceeds (or falls short of) a given Intensity is 
recorded foreach grade of .,^,"-Bionie(rifai, vol. vii., 1909, p. BS. 
(Ifl) Pbabsok, Earl, "On aNewMethod of £>etenniuiag CorreUtioa, when 
one Variable is given hy Alternative and the other by Multiple 
Categories," Biimelrika, vol. vii., 1910, p. 218. 
See also the memoir (12) by Sheppard. 

Variou UetlkodB and their Kelation to Normal Oorrelation. 

(17) Feabsoh, Karl, "On the Theory of Contingency and its Relation tc 
Association and Normal Correlation," Draper^ Company Research 
Memoirs, Biometric Seria I. ; Dulau 4 Co. , London, 1904. 

(IS) Pbabsok, Karl, "On Further Methods of Determining Correlation," 
Drapers' Company Research Memoin, Biomstrie Series IV. {Methods 
based on correlation of ranks: diflerence methods.) Dulau & Co., 
London, 1907. 

(19) Spbarhan, C, "A Footrule for Measuring Correlation," Brit. Jour, of 

Pfydwlogy, vol. iJ,, 1908, p. 89. (Thesoggestion ofa " rank "method : 
Bee Pearson's criticism and improved formula in (18) and Spearman's 
reply on some pointa in (20).) 

(20) SpKARUAN, C, "Correlation calculated from Faulty Data," BrU. Jowr. 

of Psychology, vol, iii., 1910, p. 271. 

(21) Thorkpike, K. L., " Empirical Studies in the Theory of Heasarement," 

Arekivfi of Psychology (New York), 1907. 



THBOBT OF STATISTICS. 



1. Dsduce equation (II) from the equations fartraneFormation ofco-ordiu&tee 
without aasumiiig ths uoriual distribution. (A proof will bs fotrnd in ref. 10.) 

2. Henee show that if the pairs of observed valnes of x^ and x^ are repre- 
sented by points on a plane, and a atrnight line drawn through the mean, tJie 
sum of the aqnares of the distances of the points from this line is ■ minimum 
if the line is the major principal axis. 

3. The coeffloient of correlation with reference to the principal axes being 
zero, and with reference to other axes K/mething, there must be soma pair of 
■l«« at right angles for which tlie correlation is a maximum, i.e. is numerically 
greatest without regard to sign. 81)ow that these axes make an angle of 45* 
with the principal axes, and that the maximum value of the correlation is — 



4. (Sheppard, ref. 12.) A fourfold table is formed from a normal correla- 
tion table, taking the points of dirision between A and a, B and ft, at the 

modiana, sothatM) = («) = (-S) = W = N/2. Show that 



.(-?<-)).. 



n,gN..(jNGoogle 



CHAPTER XVII. 

TEE SIHf lES CASES OF SAHf LINO 70S VABUBLES : 
?£BCEHIILES ADD HEAD. 

1-2. The problem of aampliDg for Tariables ; the cooditioiiB aasumed — 
3. Staiid&rd error of a perceQtiU^4. Special valueaforthe peiceutites 
of a normal diatributioD— 6. Effect of the form of the distribution 
eenerully — 6. SimpliBeti formula for the case of a. grouped frequency- 
distribUDion— 7. Correlation between arrora in two percentiles of the 
1 distribution— 8. Standard error of the interquartile 



for the normal curve — B. Effect of removing the restrietlouB of aimplo 
sampliag, and limitalioDS of interjiretation— 10. Standard error of 
the arithmetic mean — 11. Relative stabilit; of lueau and median in 
salDt^ing — 12, Standard error of the difference between two means— 
13. The tendency to normality of a distribution of means — 14. Etfect 
of removing the restrictioca of simple sampling — 15. Statement of the . 
standard errors of standard deviation, eoeffioiant of variation, corrO' 
latiou coefBcient and regression — 10. Restatement of the limitations 
of interpretation if the Bomple be smftll, 

1. Ik Chapters XIII.-XVI. we have been concerned solely with 
the theory of sajnpltng for the case of attributes and the frequency- 
dietributjons appropriate to that case. Wo now proceed to 
consider some of the simpler theorems for the case of variables 
(cf. Chap. XIII. § 3). Suppcffle that we have a bag containing a 
practically infinite number of tickets or cards bearing the recoided 
values of aome variable X, and that we draw a ticket from this 
bag, note the value that it bears, draw another, and so on until 
we have drawn n cards (a number small compared with the whole 
number in the bag). Lot us continue this process until we have 
N such samples of n cards each, and then work out the mean, 
standard-deviation, median, etc., for each of the samples. No one 
of these measures will prove to be absolutely the same for every 
sample, and our problem is to det«rniine the standard-deviation 
that each such measure will exhibit. 

2. In solving this problem, we must be careful to define 
precisely the conditions which are assumed to subsist, so as to 
realise the limitations of any solution obtained. These conditions 



332 THEOBT OF STATISTICS. 

were diacuased very fully for the oaae of attributes (Gh&p. XIII. 
g 8), and we would refer the student to the discussion then given. 
Here it is sufficient to state the aasumptions briefly, using the 
letters (a), (6) and (c) to denote the corresponding assumptions 
indicated by the same letters in the section cited. 

(a) We assume that we are drawing from precisely the same 
record throughout the experiment, so that the chance of drawing 
a card with any given value of X, or a value within any assigned 
limits, is the same at each sampling. 

(b) We assume not only that we are drawing from the same 
record throughout, but that each of our cards at each drawing 
may be regarded quite strictly as drawn from the same record {or 
from identically similar records) : e.ff. if our cord-record is con- 
tained in a series of bundles, we' must not make it a practice to 
take the first card from bundle number 1, the second card from 
bundle number 2, and so on, or eke the chance of drawing a 
card with a given value of X, or a value within assigned limits, 
may not be the same for each individual card at each drawing. 

(c) We assume that the drawing of each card is entirely 
independent of that of every other, so that the value of X recorded 
on card 1, at each drawing, is uncorrelatod with the value of X 
recorded on card 2, 3, 4, and so on. It is for this reason that we 
spoke of the record, in g 1, aa containing a practically infinite 
number of cards, for otherwise the successive drawings at each 
sampling would not be independent : if the bag contain ten 
tickets only, bearing the numbers 1 to 10, and we draw the card 
bearing 1, the average of the following cards drawn will be higher 
than the mean of all cards drawn ; if, on the other hand, we draw 
the 10, the average of the following cards will be lowerthan the mean 
of all cards — i.e. there will be a negative correlation between the 
number on the card taken at any one drawing and the card taken 
at any other drawing. Without making the number of cards in 
the l^g indefinitely large, we can, ea already pointed out for the 
case of attributes (Chap. XIII. § 3), eliminate this correlation by 
replacing each card before drawing the next. 

Sampling conducted under these conditions we shall, as before, 
speak of as simple sampling. We do not, it should be noticed, 
make the further assumption that the sample is unbiassed, i.e. 
that the chance of inclusion in the sample is independent of the 
value of X recorded on the card (ef. the last paragraph in § S, 
Chap. XIII., and the discussion in g§ 4-8, Chap. XIV.). This 
assumption is unnecessary. If it be true, the interpretation of 
our results becomes aimpler and more straightforward, for we 
can substitute for such phrases as " the standard-deviation of X 
in a very large sample," "the form of the frequeney-distribution 



THE SIMPLER CASES OF SAMPLING FOR VARIABLES. 333 

in a very large lample," the phrases "the staodard-deviation of 
X in the (rriginal record," " the form of the frequency -distribution 
in the original record": but in very many, perhaps the majority 
of, practical cases the very question at issue is the nature of the 
relation between the diatributiou of the sample and the distribu- 
tion of the record from which it is drawn. As has already been 
emphaBiaed in the passages to which reference is made above, no 
examination of samples drawn under the same conditions can 
give any evidence on this head. 

3. Standard Error of a Percentile. — Let us consider first the 
fluctuations of sampling for a given percentile, as the problem is 
intimately related to that of Chaps. XIII.-XIV. 

Let Xp be a value of X such that pJV of the values of X in 
an indefinitely large sample drawn under the same conditions lie 
above it and qN' below it- 

If we note the proportions of observations above Xj, in samples 
of n drawn from the record, we know that these observed values 
wi ll ten d to centre round p aa mean, with a standard-deviation 
•Jpqjn. If now at each drawing, aa well as observing the pro- 
portion of Xb above X^, say p + h, for the sample, we also proceed 
to note the adjustment e requir^ in Xp to make the proportion 
of observations above Xj, + t in the sample /iti, the standard- 
deviation of t will bear to the standard -deviation of S the same 
ratio that < on an average bears to S. But this ratio is quite 
simply determinable if the number of observations in the sample 
is sufficiently large to justify us in assuming that h is small — so 
small that we may regard the element of the frequency curve 
(for a very large sample) over which X,, -i- 1 ranges as approximately 
a rectangle. If this assumption be made, and we denote the 
standard-deviation of X in a very large sample by ir, and the 
ordinate of the frequency curve at Xp when drawn with unit area 
and unit standard-deviation by y,. 



Therefore for the standard-deviation of t or of the percentile 
corresponding to a proportion p we have 



T Ipq 



■ ■ (1) 

4. If the frequency-distribution for the very large sample be a 
normal curve, the values of y^ for the principal percentiles may be 
taken from the published tables. A table calculated by Mr 
Sheppard (Table IV., ref. 14, in Appendix I.), gives the values 



334 THEORY OF STATISTICS. 

direotl}', and these have been utilised for the following: the 
student can estimate the values roughly by a combined use of the 
area and ordiDate tables for the normal ourve given in Chapter 
XV., rememberii^ to divide the ordinates given in that table by 
ik/3r so as to make the area unity — 

Value of y. 
Median 0-3989423 



Deciles 4 and 6 
3 and 7 



„ 1 and 9 
Quartiles 



0'3863425 
0-3476926 
0-2799619 
01764983 
0-3177766 



Tnserting these values of y^ in equation (1), we have the 
following values for the standard errors of the median, deciles, 
etc., and tJie values given in the second column for their probable 
errors (Chap. XY. % 17), which the student may sometimes find 
useful : — 







Probable error ii 




.,/Vn mnltiplied by 


ff/Vn multipliad 


Median 


. 1-26331 


0-84535 


Deciles 4 and 6 


. 1-26804 


0-85528 


„ 3 and 7 


. 1-31800 


0-88897 


„ 2 and 8 


. 1-42877 


0-96369 


„ 1 and 9 


. 1-70942 


M5298 


Quartiles . 


. 1-36263 


0-91908 



It will be seen that the influence of fluctuations of sampling on 
the several percentUes increases as we depart from the median : 
the standard error of the quartiles is nearly one-tenth greater than 
that of the median, and the standard error of the first or ninth 
deciles more than one-third greater. 

5. Consider further the influence of the form of the frequency- 
distribution on the standard error of the median, as this is an 
important form of average. For a distribution with a given 
number of observations and a given standard-deviation the 
standard error varies inversely as y^ Hence for a distribution in 
which yp is small, for eiample a U-shaped distribution like that 
of fig. 18 or fig. 19, the standard error of the median will be 
relatively high, and it will, in so far, be an undesirable form of 
average to employ. On the other hand, in the case of a distribu- 
tion which has a high peak in the centre, so as to exhibit a value 
of jp large compared with the standard-deviation, the standard 
error of the median will be relatively low. We can create suoh a 



THE SIMPLER CASKS OP SAMPLING FOB VARIABLES. 335 

"peaked" distribution by superposing a normal curve nith a 
small standard-deviation on a normal curve with the same mean 
and a relatively large standard-deviation. To give some idea of 
the reduction in the standard error of the median that may be 
effected by a moderate change in the form of the distribution, let 
us find for what ratio of the standard-deviations of two sucb curves, 
having the same area, the standard error of the median reduces to 
"'/Vn, where <t is of course the standard-deviation of the com- 
pound distribution. 

Let cT], (Tg be the standard-deviations of the two distributions, 
and let there be n/2 observations in each. Then 



=/ 



o1+i| 



(«) 



On the other hand, the value of y^ is — 



Hence the standard error of the median is 



2x.(r,/V 3 



■Jl 



(c) is equal to <rj-Jn if 
Writing wg/o-j-^p, that is if 

(i-Hp)s/riv _^ 

2^/:^p 
or 

p* -H 2p« -H (2 - iv)p^ -f 2p -f 1 = 0. 

This equation may be reduced to a quadratic and solved by 

taking pH aa a new variable. The roots found give p = 2-2360 

.... or 04472 . . . ., the one root being merely the reciprocal of 
the other. The standard error of the median will therefore be 
vjjn, in such a compound distribution, if the standard-deviation 
of the one normal curve is, in round numbers, about 2J times 
that of the other. If the ratio be greater, the standaid error 
of the median will be less than trl^Jn. The distribution 

ogle 



336 THEORf OF aTATISTICS. 

for which the standard error of the median is exactly equal to 
vlJn is shown in fig. 53 : it will be seen that it is by no means 
a very striking form of distribution ; at a hasty glance it might 
almost be taken as normal. In the case of dtHtribtitions of a form 
more or less similar to that shown, it is evident that we cannot 
at all safely estimate by eye alone the relative standard error of 
the median as compared with trfjii, 

6. In the case of a grouped frequency-distribution, if the 
number of observations is sufficient to make the class-frequenci^ 
run fairly smoothly, i.e. to enable us to r^ard the distribution 




as nearly that of a very large sample, the standard error of any 
percentile can be calculated very readily indeed, for we can 
eliminate a- from equation (1), Let /^ be the frequency-per- 
class-interval at the given percentile— simple interpolation will 
give us the value with quite sufficient accuracy for practical 
purposes, and if the figures run irregularly they may be smoothed. 
Let (7- be the value of the standard-deviation expressed in class- 
intervals, and let n be the number of observa.tionB as before. 
Then since y^ is the ordinate of the frequency-distribution when 
drawn with unit standard-deviation and unit area, we must 
have 

<r, 

n,gN..(JNCjOOglC 



THE SIMPLEB CASKS OP SAMPLING FOR VARlABLEa. 337 

But this gives at once for the standard error eayyreistd in terms 
of the eloM-inlerval at unit 

tr^ ^ ■ ■ ■ ■ ^^) 

Ab an example in which we can compare the results given hy 
the two different formulee (1) and (2), take the distribution of 
stature used as an illustration in Chaps. VII. and VIII. and in 
§§ 13, 14 of Chap. XV. The number of observations is 8S85, 
and the standard-deviation 2'57 in., the distribution being 
approximately normal : (r/^n = 0'027737, and, multiplying by the 
factor 1-2B3 .... given in the table in § 4, this givea 0'0348 
as the standard error of the median, on the assumption of 
normality of the distribution. Using the direct method of 
equation (2), we find the median to be 67-47 (Chap. VII. g 15), 
which 18 very nearly at the centre of the interval with a 
frequency 1329. Taking this as being, with sufficient accuracy 
for our present purpose, the frequency per interval at the median, 
the standard error is 

As we should expect, the value is practically the same as that 
obtained from the value of the standard-deviation on the assump- 
tion of normality. 

Let us find the standard error of the first and ninth deciles 
as another illustration. On the assumption that the distribu- 
tion is normal, these standard errors are the same, and equal to 
0-027737 xl -70942 = 0'0474. Using the direct method, we 
find by simple interpolation the approximate frequencies per 
interval at the first and ninth deciles respectively to he 590 and 
570, giving standard errore of 0-0471 and 0-0488, mean 0-0479, 
slightly in excess of that found on the assumption that the fre- 
quency is given by the normal curve. The student should notice 
that the class -interval is, in this case, identical with the unit of 
measurement, and consequently the answer given by equation (2) 
does not require to be multiplied by the m^;nitude of the 
interval. 

In the case of the distribution of pauperism (Chap. VII., 
Example i.), the fact that the class-interval is not a unit must 
be remembered. The frequency at the median (3-195 per cent.) 
is approximately 96, and this gives for the standard error of the 
median by (2) (the number of observations being 632) 0-1309 
intervals, that is 0'065B per cent. 

7. In finding the standard error of the difierence betwee 



338 THBOHT O? STATISTICS. 

percentiles in tbe same diBtributioti, the student must be care- 
ful to note that the errors in two such percentiles are not 
independent. Consider the two percentileu, for which the vaiues 
ofp and q are ;>, q-y, p^ q^ respectively, the first-named being the 
lower of the two percentiles. These two percentiles divide the 
whole area of the frequency curve into three parts, the areas of 
which are proportional to ?i, 1 - q^ -p^, and p^. Further, since 
the errors in the flrst percentile are directly proportional to the 
errors in g^, and the errors in the second percentile are directly 
proportional but of opposite sign to the errors in p^, the corre- 
lation between errors in the two percentiles will be the same as 
the correlation between errors in g^ andpj but of opposite sign. 
But if there be a deficiency of observations below the lower 
percentile, producing an error Sj in q^, the missing observations 
will tend to be spread over the two other sections of the curve 
in proportion to their respective areas, and will therefore tend to 
produce an error 

^ Pi ' 
in Py If then r be the correlation between errors in 9, and p^ 
€[ and «2 their respective standard errors, we have 



Or, inserting the values of the standard errors, 



* ■7, 



The correlation between the percentiles is the same in ma^- 
tude but opposite in sign : it is obviously positive, and consequently 

correlation between errors I = . /p^i .0, 

in two percentiles J V q^^ • \ I 

If the two percentiles approach very close together, ?, and q^ 
Py and p^ become sensibly equal to one another, and the correla- 
tion becomes unity, as we should expect. 

8. Let us apply the above value of the correlation between 
percentiles to find the standard error of the aemi-interquartile 
range for the normal curve. Inserting jj =pg = ^, g^=Py='%, we 
find *■ = J. Hence- the standard error of the interquartile range 
is, applying the_ ordinary formula for the standard-deviation of a 
difference, 2/^3 times the standard error of either quartile, or 



THE SIMPLER CASES OF SAMPLING FOR VARIABLES. 339 

the Standard error of the Mmi-interquartile range 1/^3 times 
the standard error of a quartile. Taking the value of the 
standard error of a quartile from the table in § 4, we have, finally. 



standard error of the semi- 1 
normal distribution 



terquartile range in a> =0-78672-7^ 



Of course the standard-deviation of the inter-quartile, or semi- 
interquartile, range can readily be worked out in any particular 
case, using equation (2) and the value of the correlation 
^yen above : it is best to work out such standard errors 
from first principles, applying the usual formula for the standard 
deviation of the difference of two correlated variables (Chap. XI. 
§ 2, equation (1)). 

9. If there is any failure of the conditions of simple sampling, 
the formule9 of the preceding sections cease, of course, to bold 
good. We need not, however, enter again into a discussion of 
the effect of removing the several restrictions, tor the effect on 
the standard error of p was considered in detail in g§ 9-14 of 
Chap, XIV., and the standard error of any percentile is directly 
proportional to the standard error of p (cf, § 3). Further, the 
student may be reminded that the standard error of any per- 
centile measures solely the fluctuations that may be expected in 
that percentile owing to the errors of simple 'sampling alone ; it 
has no bearing, therefore, save on the one question, whether an 
observed divergence of the percentile, from a certain value that 
might be expected to be yielded by a more extended aeries of 
observations or that had actually been observed in some other 
series, might or might not be due to fluctuations of simple 
sampling alone. It cannot and does not give any indication of 
the possibility of the sample being biassed or unrepresentative of 
the nkaterial from which it has been drawn, nor can it give any 
indication of the magnitude or influence of definite errors of 
observation — errors which may conceivably be of greater im- 
portance than errors of sampling. In the case of the distribution 
of statures, for instance, the standard error almost certainly gives 
quite a misleading idea as to the accuracy attained in determining 
the average stature for the United Kin^om : the sample is not 
representative, the several parts of the kingdom not contributing 
in their true proportions. The student should refer again to the 
discussion of these points in g§ 4-8 of Chap. XIV. Finally, we 
may note that the standard error of a percentile cannot be 
evaluated unless the number of observations is fairly large—large 
enough to determine /^ (eqn. 3) with reasonable accu^y^, \m 



340 THEORY OF STATISTICS. 

to test whether we maj treat the distribution as approsimately 
normal {cf. also § 16 below). 

(As regards the theory of sampling for the median and per- 
centiles generally, ef. ret. 12, Laplace, Supplement II. (standard 
error of the median), Edgeworth, refs. 4, 5, 6, and Sheppard, ref. 
21 : the preceding sections have been baaed on the work of 
Edgeworth and Sheppard.) 

10. Standard Error of the Arithmetic Mea/n. — Let us now pass 
to a fresh problem, and determine the standard error of the 
arithmetic mean. 

This is very readily obtained. Suppose we note separately at 
each drawing the value recorded on the first, second, third .... 
and jith card of our sample. The standard-deviation of the values 
on each separate card will tend in the long run to be the same 
and identical with the standard-deviation a- of x in an indefinitely 
large sample, drawn under the same conditions. Further, the 
value recorded on each card is (as we assume) uncorrelated with 
that on every other. The standard-deviation of the sum of the 
values recorded on the n cards is therefore Jn.ir, and the 
atandard-deviation of the mean of the sample is consequently 
1/nth of this ; or, 

'--^- ■ ■ . . (5) 

This is a most important and frequently cited formula, and the 
student should note that it has been obtained without any 
reference to the size of the ^mple or to the form of the frequency- 
distribution. It is therefore of perfectly general application, if 
(T be known. We can verify it against our formula for the 
standard-deviation of sampling in the case of attributes. The 
standard-deviation of the number of successes in a sample of m 
observations is -Jm-pq: the standard-deviation of the total 
number of successes in n samples of m observations each is there- 
fore Jnvt.pq : dividing by n we have the standard-deviation of 
the mean number of successes in the n samples, viz. Jmpqj Jn, 
agreeing with equation (5). 

11. For a normal curve the standard error of the mean is to 
the standard error of the median approximately as 100 to 135 
(cf. § 4), and in general the standard errors of the two stand in 
a somewhat similar ratio for a distribution not differing largely 
from the normal form. For the distribution of statures used as 
an illustration in g 6 the standard error of the median was found 
to he 0-0349 : the standard error of the mean is only 0-0277. 
The distribution being very approximately normal, the ratio of 



THE SIMPLEK CASES OF SAMPLING FOR VAEIABLES. 341 

the two standard errors, viz, 1 '26, assumea almost exactly the theo- 
retical magnitude. In the case of the aaymmetrical dUtributJon of 
rates of pauperism, also used as an illustration ia § 6, the standard 
error of the median waa found to be 0-0655 per cent. The 
standard error of the mean is only 0'0193 per cent., which bears 
to the standard error of the median a ratio of 1 to 1*33. As 
such cases as these seem on the whole to be the more common 
and typical, we stated in Chap. VII. § 18 that the mean is mp^ 
general less affected than the median by errors of sampling. At 
the same time we also indicated the exceptional cases in which 
the median might be the more stable — cases in which the mean 
might, for example, be affected considerably by small groups of 
widely outlying observations, or in which the frequency-distri- 
bution assumed a form resembling fig. 53, but even more 
eia^erated as regards the height of the central " peak " and the 
relative length of the "tails." Such distributions are not un- 
common in some economic statistics, and they might be expected 
to characterise some forma of eiperimental error. If, in these 
cases, the greater stability of the median is sufficteutly marked 
to outweigh its disadvantages in other respects, the median 
may be the better form of averse to use. Fig. 53 represents 
a distribution in which the standard errors of the mean and of the 
median are the same. Further, in some experimental cases it is 
conceivable that the median may be less affected by definite 
experimental errors, the average of which does not tend to be 
zero, than is the mean, — this ia, of course, a point quite distinct 
from that of errors of sampling. 

12. If two quite independent samples of n^ and n^ observations 
■respectively be drawn from a record, evidently t^^ the standard 
error of the difference of their means ia given by 

«5.=<Vs) ■■..(«) 

If an observed difference exceed three times the value of t■^^ 
given by this formula it can hardly be ascribed to fluctuations 
of sampling. If, in a practical case, the value of ir is not known 
a priori, we must substitute an observed value, and it would seem 
natural to take as this value the standard-deviation in the two 
samples thrown together. If, however, the standard-deviations 
of the two samples themselves differ more than can be accounted 
[or on the basis of fluctuations of sampling alone (see below, § 15), 
we evidently cannot assume that both samples have been drawn 
from the same record : the one sample must have been drawn 
from a record or a universe exhibiting a greater etandard-devial^ 



342 THROBT O? aiATISTICS. 

than the other. If two samplea be drawn quite independently 
from different universes, indefinitely lai^e samples from which 
exhibit the etandard-deviations <r, and o-j, the standard error of 
the difference of their means will be given by 

^^ = ^+^ . . . . (7) 

This is, indeed, the formula UBuaDy employed for testing the 
significance of the difference between two means in any case : 
seeing that the standard error of the mean depends on the 
standard-deviation only, and not on the mean, of the distribution, 
we can inquire whether the two universes from which samples 
have been drawn differ in mean apart from any difference in 
dispersion. 

If two quite independent samples be drawn from the same 
universe, but instead of comparing the mean of the one with the 
mean of the other we compare the mean m, of the first with the 
mean m^, of both samples together, the use of (6) or (7) is not 
justified, for errors in the mean of the one sample are correlated 
with errors in the mean of the two together. Following precisely 
the lines of the similar proble m in § 13, Chap. XIII., case III , we 
find that this correlation is «/«,/{«! + n^), and hence 

(For a complete treatment of this problem in the case of samples 
drawn from two different universes ef. ref. 19.) 

13. The distribution of means of samples drawn under the 
conditions of simple sampling will always be more symmetrical 
than the distribution of the original record, and the symmetry 
will be the greater the greater the number of observations in the 
sample. Further, the distribution of means (and therefore also of 
the differences between means) tends to become not merely sym- 
metrical but normal. We can only illustrate, not prove, the 
point here; but if the student will refer to§I3, Chap. XV., he will 
see that the genesis of the normal curve in this case is in accord- 
ance with what we then stated, viz. that the distribution tends to 
be normal wheneveT the variable naay be regarded as the sum 
(or some slightly more complex function) of a number of other 
variables. In the present instance this condition is strictly ful- 
filled. The mean of the sample of n observations is the sum of 
the values in the sample each divided by n, and we should expect 
the distribution to be the more nearly normal the lai^er n. As 
an illustration of the approach to symmetry even for bdobU values 



THE SIMPLER CABES OF SAMPLING FOK VARIABLES. 343 

of n, we may take the following case. If the student will turn to 
the calculated binomials, given ea illustrations of the forms of 
binomial distributions in Chap. XV. § 3, he will find there the 
distribution of the number of successes for twenty events when 
5 = 09, p = 01 : the distribution ia extremely skew, starting at 
zero, rising to high frequencies for 1 and 2 successes, and thence 
tailing ofir to 20 cases of 7 successes in 10,000 throws, 4 casee of 8 
successes and 1 case of 9 successes. But now tind the distribu- 
tion for the mean number of successes in groups of five throws, 
under the same conditions. This will be equivalent to finding 
the diatribution of the number of aucoesaea for 100 such events, 
and then dividing the observed number of successes by five — the 
last process making no difTerence to the form of the distribution, 
but only to its scale. But the distribution of the number of 
successes for 100 events when ^ = 0*9, p = Q-l, is also given in 
Chap. XV. § 3, and it will be seen that, while it is appreciably 
asymmetrical, the divergence from symmetry is comparatively 
small : the diatribution has gained very greatly in symmetry 
though only five observations have been taken to the sample. 
We may therefore reasonably assume, if our sample is large, 
that the distribution of means is approximately a normal dis- 
tribution, and we may calculate, on that assumption, the fre- 
quency with which any given deviation from a theoretical value 
or a value observed in some other series, in an observed mean, will 
ariae from fiuctuations of simple sampling alone. 

The warning is necessary, however, that the approach to 
normality is only rapid if the condition that the several drawings 
for each sample shall be independent ia strictly fulfilled. If the 
observations are not independent, but are to some extent positively 
correlated with each other, even a fairly lai^ sample may con- 
tinue to reflect any asymmetry existing in the original distribution 
(cf. ref. 24 and the record of sampbng there cited). 

If the original distribution be normal, the distribution of 
means, even of small samples, is strictly normal. This follows at 
once from the fact that any linear function of normally distributed 
variables is itself normally distributed (Obap. XVI. g 6). The 
diatribution will not in general, however, be normal if the 
deviation of the mean of each sample is expreased in terms of the 
standard-deviation of that sample {cf. ref. 23). 

14. Let us consider briefly the efi'ect on the standard error of 
the mean it the conditions ot simple sampling as laid down in 
§ 2 cease to apply. 

(a) If we do not draw from the same record all the time, but 
first draw a series of samples from one record, then another 
series from another record with a somewhat different mean &aA 



344 THEOBT OF STATISTICS. 

Btandard-deTi&tion, and so on, or if we draw the Bucc^ive 
samples from eaeentially different parts of the same record, the 
standard error will be greatly increased. For suppose we draw 
tj aamples from the first record, for which the standard-deTiattoo 
(in an indefinitely lai^e sample) is o-j, and the mean differs by 
d-j from the mean of all the records together (as ascertained by 
lai^e samples in numbers proportionate to those now taken) ; k^ 
samples from the second record, for which the standard- deviation 
is (Tj, and the mean differs by d, from the mean of all the records 
together, and so on. Then for the samples drawn from the first 
record- the standard error of the mean will be a-J^n, but the 
distribution will centre round a value differing by d^ from the 
mean for all the records together : and so on for the samples 
drawn from the other records. Hence, if tr^ be the standard error 
of the mean, If the total number of samples, 



jr.<7i = 2 



■{k^) + ^kd'). 



But the standard-deviation <rg for all the records together is given 

by I 

Hence, writing 2(faP) = J^.si„ 

°i-? + "-^* ■ • ■ • (») 

This equation corresponds precisely to equation (2) of § 9, Chap. 
XIV. The standard error of the mean, if our samples are drawn 
from different records or from essentially difierent parts of the 
entire record, may be increased indefinitely as compared with the 
value it would have in the case of simple sampling. If, for 
example, we take the statures of samples of n men in a number 
of different districts of England, and the standard-deviation of all 
the statures observed is o-p, the standard-deviation of the means 
for the different districts will not be ajjn, hut will have some 
greater value, dependent on the real variation in mean stature 
from distriot to district. 

(i) If we are drawing from the same record throughout, but 
always draw the first card from one part of that record, the 
second card from another part, and so ou, and these parts differ 
more or less, the standard error of the mean will be decreased. 
For if, in large samples drawn from the subsidiary ports of the 
record from which the several cards are taken, the standard- 
deviations are (Ty o-j, .... o-,„ and the means differ by dj, <^ 



THK SIMPLBB CASES OF SAMPLING FOK TARIABLB8. 346 
. d^ from the laeaa for a large eample from the entire record. 



_.rj i 



(10) 



The last equation again correBponds precisely with that given for 
the same departure from the rules of simple Bampling in the case 
of attributes (Chap. XIV. § 11., eqn. 4). If, to vary our previous 
illustration, we bad measured the statures of men in each of n 
different districts, and then proceeded to form a set of samples 
by taking one man from each district for the 6rst sample, one 
man from each district for the second sample, and so on, the 
standard-deviation of the means of the samples so formed would 
be appreciably less than the standard error of simple sampling 
a-J-Jji. As a limiting case, it is evident that if the men in each 
district were all of precisely the same stature, the means of all the 
samples so compounded would be ideutical : in such a case, in fact, 
<^o = *™> *"<^ consequently cr„ = 0. To give another illustration, if 
the cards from which we were drawing samples had been arranged 
in order of the magnitude of JT recoiled on each, we would get 
a much more stable sample by drawing one card from each 
successive nth part of the record than by taking the sample 
according to our previous rules — e.ff. shaking them up in a bag 
and taking out cards blindfold, or using some equivalent process. 

The result is perhaps of some practical interest. It shows that, 
if we are actually teking samples from a lat^ area, different 
districts of which exhibit markedly different means for the 
variable uuder consideration, and are limited to a sample of n 
observations; if we break up the whole area into m sub-districts, 
each as homc^:eneous as possible, and take a contribution to the 
sample from each, we will obtain a more stable mean by this 
orderly procedure than will be given, for the same number of 
observations, by any process of selecting the districts from which 
samples shall be taken by chance. There may, however, be a 
greater risk of biassed error. The conclusions seem in accord 
with common-sense. 

(c) Finally, suppose that, while our conditions (a) and (b) of § 2 
hold good, the magnitude of the variable recorded on one oexd 
drawn is no longer independent of the magnitude ^ 



346 THBORY OF STATISTICS. 

another oard, e.g. that if the first card drami at any sampling 
bears a high value, the next' and following cards of the some 
sample are likely to bear high values also. Under these circum- 
stances, if r,j denote the correlation between the values on the 
first and second cards, and so on, 

'■--;+'?<''■+'■"+ — +•••+ — )■ 

There are n{n-l)/2 correlations; and if, therefore, r is the 
arithmetio mean of them all, we may write 

<^"=^[i+K"-i)] ■ . ■ (11) 

As the means and standard-deviations oF x^, x^, . . . . jc^ are all 
identical, r may more simply bo regarded as the correlation 
coefficient for a table formed by taking all possible pairs of the 
n values in every sample. If this correlation be positive, the 
standard error of the mean will be increased, and for a given 
value of r the increase will be the greater, the greater the siee of 
the samples. If r be negative, on the other hand, the standard 
error will be diminished. Equation (11) corresponds precisely to 
equation (6), § 13, of Chap. XIV. 

As was pointed out in that chapter, the case when r is positive 
covers the case discussed under (a): for if we draw successive 
samples from different records, such a positive correlation is at 
once introduced, although the drawings of the several cards at 
each sampling are quite independent of one another. Similarly, 
the case discussed under (6) is covered by the case of negative 
correlation, for if each card is always drawn from a separate and 
distinct part of the record, the correlation between any two x'a will 
on the average he negative : if some one card be always drawn 
from a part of the record containing low values of the variable, 
the others must on an average be drawn from parte containing 
relatively high values. It is as well, however, to keep the cases 
(a), (6), and (c) distinct, since a positive or negative correlation 
may arise for reasons quite different from those considered under 

(.) .Dd (J). 

15. With this discussion of the standard error of the arithmetic 
mean we must bring the present work to a close. To indicate 
briefly our reasons for not proceeding further with the discussion 
of standard errors, we must remind the student that in order to 
express the standard error of the mean we require to know, in 
addition to the mean itself, the standard-deviation about the mean, 
or, in other words, the mean (deviation)^ with respect to the mean. 



THE SIMPLER CASES OF SAMPLING FOB VARIABLES. 347 

Similarly, to express tbe standard error of the standard-deviation 
we require to know, in the general case, the mean (deviation)* 
with respect to the mean. Either, then, we must Snd this quantity 
for the given distribution — and this would entail entering on a 
field of work which hitherto we have intentionally avoided- — or we 
must, if that be possible, assume the distribution to be of such a 
form that we can express the mean (deviation)* in terms of the 
mean (deviation)^. This can be done, aa a fact, for the normal 
distribution, but the proof would again take us rather beyond 
the limits that we have set ourselves. To deal with the standard 
error of the correlation coefficient would take us still further 
afield, and the proof would be laborious and difficult, if not 
impossible, without the use of the differential and integral cal- 
culus. We must content ourselves, therefore, with a simple 
statement of the standard errors of the three most important 
constants, standard-deviation, correlation coefficient, and regres- 
sion. [The fundamental memoirs are refs, 15, 17, 21.] 
Siaridard-deviafion.^li the distribution be normal, 

standard error of tbe i ^ 

standard- deviation in J> = i^ : . (12) 
a normal distribution ) 

This is generally given as the standard error in all cases : it is, 
however, by no means exact : the general expression is 

standard error of the standard- 1 /« - ul 

deviation in a distribution V =^ j^ (13) 

of any form ) " ' 

where p.^ ia the mean (deviation)*— deviations being, of course, 
measured from the mean^and /i^ the mean (deviation)^ or the 
square of the standard-deviation : n is assumed sufficiently large 
to make tbe errors in the standard-deviation small compared with 
that qiiantity itself. Equation (13) may in some cases give 
values considerably greater — twice as great or more — than (12). 
(Cf. ref. 14.) If, however, the distribution be normal, equation 
(1"2) gives tbe standard error not merely of standard-deviations of 
order zero, to use the terminology of Chap. XII., but of standard- 
deviations of any order (ref. 25). It will be noticed, on reference 
to equation (4) above, § S, that the standard error of the standard- 
deviation is absolutely greater but relatively less than that of the 
semi- interquartile range for a normal distribution. 
For a normal distribution, again, we have — 



■s)' + <Tfoi)'ixH;l!») 



THEORY OF STATISTICS. 



The ezpTMsioD in the bracket is usually very nearly unity, for 
a normal dietiibution, and in that case may be neglected. 
Correlation eoeiffieient. — If the diBtribution be normal. 



r of the cor- j 
.c™Lr.^-l. ^coefficient for j ~ /~ 
a normal distribution | ^ 



relation coefficient for > = "~y- . . (15) 



This is the value always given : the use of a more general formula 
which would entail the use of higher momenta does not appear 
to have been attempted. As r^atds the case of small samples, 
ef. ref. 23. Equation (16) gives the standard error of a coefficient 
of any order, total or partial (ref. 25). 

Coefficient of regreuion. — If the distribution be normal, 

standard error of the co- | 



efficient of regression h,, > = — ~ t^~ = ' i . (16) 
for a normal distribution ) ' ^ 

This formula ^;ain applies to a regression coefficient of any order, 
total or partial : i.t. in terms of our general notation, k denoting 
any collection of secondary subscripts. 



To convert any standard error to the probable error multiply by 
the constant 0'674489 .... 

16. We need hardly restate once more the warnings given in 
Chap. XIV., and repeated in g 9 above, that a standard error can 
give no evidence as to the biassed or representative character of 
a sample, nor as to the magnitude of errors of observation, but 
we may, in conclusion, again emphasise the warnings given 
in ^ 1-3, Chap. XIV., as to the use of standard errors when 
the number of observations in the sample is small. 

In the first place, if the sample be small, we cannot in general 
assume that the distribution of errors is approximately normal : 
it would only be normal in the case of the median (for which p 
and q are equal) and in the case of the mean of a normal distri- 
bution. Consequently, if n be small, the rule that a range of 
three times the standard error includes the majority of the 
fluctuations of simple sampling of either sign does not strictly 
apply, and the " probable error " becomes of doubtful significance. 

Secondly, it will be noted that the values of o- and y^ in (1), of 
/p in (2), and of <r in (4) and (5), i.e. the values that would be 
given for these constants by an indefinitely large sample drawn 



THE SIMPLER C&SES OF SAMPLINO 70R VARIABLRS. 349 

under the same conditions, or the values that they possees in 
the original record if the sample is unbiassed, are assumed to be 
known apriori. But this is only the case in dealing with the 
problems of artiScial chance : in practical cases we have to use 
the values given us by the sample itself. If this sample is baaed 
on a considerable number of observationB the procedure is safe 
enough, but if it be only a small sample we may possibly mis- 
estimate the standard error to a serious extent Following the 
procedure su^ested in Chap. XIV., some rough idea as to the 
possible extent of undei-estimation or over-estimation may be 
obtained, e.g. in the caae of the mean, by first working out the 
standard error of cr on the assumption tliat the values for the 
necessary moments are correct, and then replacing ir in the 
expression for the standard error of the mean by o- ± three times 
its standard error so obtained. 

Finally, it will be remembered that unless the number of 
observations is large, we cannot interpret the standard error of 
any constant in the inverse sense, i.e. the standard error ceases 
to measure with reasonable accuracy the standard-deviation of 
true values of the constant round the observed value (Chap. 
XIV. § 3). If the sample be large, the direct and inverse 
standard errors are approximately the same. 



REFERENCES. 



191. 

(2) BowLBT, A. L., The Miamireinent of Groups and Striea ; C. Ji E Lsyton, 

London, igOS. 

(3) BowLBT, A. L.. Address to Section F of the BriHth AuocioUoJt, 1908. 

(4) Edgswokth, F. v., " Observntians and StatisticB : Ad Essay on the 

Theory of Errors of Observation and the First Principles of Statistics," 
Cambridge Phil. Trava., vol. liv., 1886, p. 139. 

(5) Edoeworth, F. Y., "ProblemBin Probabilities, "fAii. Jfa?., SthSeries, 

vol. xiii,, laaa^. 371. 

(8) Edqkwoeth, F. Y., "The Choice of Means," Phil. Mag., 5th Series, 
■ ., 1887, p. 288, 
B, F. y.. "On" 
Jour. Soy. SUU, Soc., 
Addendum, vol. liiii,, 1909, p. 81. 

(8) Eldbeton, W. Palin, "Tables for Testing the Goodness of Fit of Theory 

to ObservatioD," Biomeirika, vol. i., 1902, p. 155. 

(9) Gibson, Wihifkkd, " Tables for Facilitating the Computation of 

Probable Errors," BiometHka, vol. iv,, 1908, p. 385. 
(10) Hehoh, D., " An Abac to determine the Probable Errors of Correlation 
CoefUcientu," Biomeiriia, vol. vii., 1910, p. 411. (A diagram giving 
the probable error for any number of observations up to 1000. ) 



350 THEOBT OF STATISTICS. 

(11) Hkbon, D., " On the Probable Error of a Partui Comlatioii CoefEcieat," 

Biometriica, vol. vii., 1910, p. ill. (A proof, on ordinary algobruc 
Mama, for the case of three Tsrisbles, of the result giveo in (2.'i).) 

(12) LAPL1.CK, Fierux Siuoh, Marquis de, Thiorie des prtAabitiUs, V (Aa., 

1814. (With four sopplemaDts. ) 

(13) Pkabl, Baymond, " The Caloulatioa of the Probable Error* of CerUin 

ConstAiits of the Normal Curve,'' Biometrika, vol. v., 1906, p. 190. 

(14) Pkahl, Raymond, " On certain Points conceming the Probable Error 

oftbe Standard- deviation," Bit/inttrika, vol. vi., 1908, p. 112. (On 
the amount of divergence, in certain cases, from the probable error 
ir/VSn in the case of a normal diatribution. ) 
(lf>) Pbabson, Kabl, and L. N. O. Pilon, "On the Probable Erroie of 
Frequency Constants, and od the lofluence of Baodom Selection on 
Variation and CorrelatJoD," Phil. Trans. Roy. Soc., Series A, vol. eici,, 
1898, p. 289. 

(16) Pkarson, Karl, " On the Criterion that a given System of Deviations 

from the Probable in the Case of a Correlated System of Variables ia 
such t)iat it can be reasonably supposed to have arisen from Random 
Sampling," Phil. Mag., 6th series, vol. ]., 1900, p. 157. 

(17) Pbakbon, Kahl, and otheta (editorial), "On the Probable Errors of 

Frequency OonstantB," £ioin*(rtia, vol. ii., 1908, p. 273. (Useful for 
the general fonuuln given, based on the general case without respect to 
the fonn of the frequency -diatributioD.) 

(15) Pbarsoh, Kabl, " On the Carves which are most suitable for describinK 

the Frequency of Bandom Samples of a Population," Biometrika, vol. 



181. 

(20) Bhihd, a,, " Tables for Facihtating the Computation of Probable Errors 

of the Chief Constants of Skew Frequency-distributions," Bufoutriia, 
vol. vii., 1609-10, p. 127 and p. 388. 

(21) ShbpparI), W. F., "On the Application of the Theory of Error to Coses 

of Normal Distribution and Normal Correlation," Phil. Trans, Soy. 
Soc., Series A, vol. cjoii., 1898, p. 101. 

(22) "Student," "On the Probable Error of a Mean," Biometrika, vol, ri., 

1S08, p. 1. (The standard error of the mean in terms of the standard 
error of the sample.) 

(23) "Stdiibnt," "On the Probable Error of a Correlation Coefficient," 

Biometrika, vol. vi., 1908, p. 302. (The problem ofthe probable error 
with small samples. ] 

(24) "Stcdbnt," "On Uie Distribution of Means of Samplea which are not 

drawn at Random," Biometrika, vol. vii., 1909, p. 210. 

(25) YnLB, G. U., "On the Theory of Hormal Correlation for any number of 

Variables treated by a New System of Notation," Prot. Soy. Soc., 
SeriesA, vol. Ixxii., 1907, p. 182. (See pp. 192-3 at end.) 

Reference may also be made to the following, which deals for tiie 
most part with the effects of errors other tban errois of sampling : — 
(26} BovLEY, A. L., "Relations between the Accuracy of an Average and 
that of its Constituent Parts," /our. Roy. Stat. Soc, vol. li., 1897, 



N Google 



THE SIMPLER CASES OF SAMPLINQ FOR VARIABLES, 351 



EXERCISES. 

1. For the data in the laat eolunm of Table IX., Chap. VI. p. 96. find 
the BtaDdard error of the median (1G4-7 11^.). 

5. For the same diatribntioD, find tlie itandard errars of the two quartiles 
(U2-6Iba.. 188-4 Ibi.). 

3. For the same distribution, find the standard error of the semi'inter- 
qnartile range. 

4. The standard-deviation of tbe same distribution is 21-3 lbs. Find the 
atandard error of the mean, and compare its magnitude with that of the 
standard error of the median (Qn. 1), 

6, Work out the standard error of the standird-deTiation for the difltribu- 
tton of Btatures used as an illustration in g 6, (Standard -deviation 2'57 in. : 
8SS5 obserrationB.) Compare the ratio of standard error of atandard- 
deviation to the standard- deviation, with the ratio of the standard error of 
the aemi-interquartile range to the semi-interquartile range, assuming the 
distribution normal. 

8. Calculate a small table giving the standard errors of the oorrelation 
ooeffioient, based on [1) 100, (2) 1000 observations, for values of r = 0, 0-2, 04, 
0'6, a, assuming the distribution normal. 



n,gN..(JNGOOglC 



APPENDIX I. - 

TABLES FOR FACILITATING STATISTICAL WORK. 

A. CALCITLATINa TABLES. 

For heavy arithmetical work an arithmometer is, of course, 
invaluable ; but, owing to their coat, arithmetic machinee are, as a 
rule, beyond the reach ot the student. For a great deal of simple 
work, eepecially work not intended for publication, the student 
will find a slide-rule exceedingly useful : particulars and prices 
will be found in any instrument maker's catalogue. A plain 
2B-cm. rule will serve for most ordinary purposes, or if greater 
accuracy is desired, a 50-cm. rule, a Fuller spiral rule, or one of 
Hannyngton-pattem rules (Aston 4 Mander, London), in which 
the scale is broken up into a number of parallel segments, may be 
preferred. For greater eiactneas in multiplying or dividing, 
logarithms are almost essential : five-figure tables suffice it answers 
are only desired true to five digits ; if greater accuracy is needed, 
seven-figure tables must be used. It is hardly necessary to cite 
special editions of tables of logarithms here, but attention may 
perhaps be directed to the recently issued eight-figure tables of 
Bauschinger and Peters (W. Engelmann, Leipzig, and Asher & Co., 
London, 1910; vol. i. containing logarithms of all numbers from 
1 to 200,000, price 18e. 6d. net.; vol. ii. to contain It^. of 
trigonometric functions). 

If it is desired to avoid logarithms, extended multiplication 
tables are very useful. There are many of these, and four of 
different forms are cited below. Zimmermann's tables are inex- 
pensive and recommended for the elementary student, Cotsworth'a, 
Crelle'a, or Petora' tables for more advanced work. Barlow's tables 
are invaluable for calculating atandard-deviationa of ungrouped 
observations and similar work. 

(1) Bablow's TabUa of Squares, Cuba, Squart-roots, Cube-root), and Steip- 
rocals of all Integer Nujitbers up to 10,000 ; E. A F. N. 3pon, 
London &nd New York ; stereotype aditioD, price 0s. i , 



APPENDIX I.— 3PICCUL TABLES OP PUNCnONS, BTC. 353 

(2) CoTSWORTH, M. B., The Direct Caleulalor, Serms O. (Product table to 

lOOOx'lOOD.) M'GorqaodaleJcCo., London; price with thumb index, 
25i.; witboat index, 218. 

(3) Cbsllg, a. L., Sechenta/eln. (Multiplication table giving all productti np 

to IQOOxlOOO.) Can be obtained witb explanatorj introduation in 
Qerman or in English. Q, Reimer, Berlin ; price ]{>b. 
(4}ELDaET0N, W. P. "Tables of Powers of Natnral Nambers, and of the 
Sums of Powers of the Natural Numbers from 1 to 100" (gitres 
powers up to seventh), Biometrika, vol. it p, 47*. 
RS, J. , Ntitt Stekailqftln /Ur Mvilipliixaum wnd Duiinon. 



prodoots up to 100x10,000: more couvenient than Crelle for fonniue 
lour-fl(;aTB products. Introdui"' -"'■'-■ 
Q. Beimer, Berlin ; price 16b. 



ir-fl(;aTB products. Introduction in English, French or German.) 



(e) ZiMiUBitAiiN, H., Stehenla/el, nebet Sammlung hauGg gebraneht«r 
Zablenwerthe. (ProdactB of all numbers up to 100 x 1000 : subsidiary 
tables of Bdnares, cubes, square-roots, cube-roots and reciprocala, etc. 
for all nnmbara np to 1000 at the foot of the page.) W, Ernst * Son, 
Berlin ; price Sa. ; English edition, Asher & Co. , London, Ss. 

B. SPECIAL TABLES OF FUNCTIONS, ETa 

Several tables of service will be found in the works cited in 
Appendix II., e.g., a table of Gamma Functions in Elderton's 
book (10) and a table of sis-figure logarithms of the factorials 
of all numbers from 1 to 1100 in De Morgan's treatise (9). The 
tables cited below from Biometrika are to be included with others 
in a volume entitled Tables for the Use of Statisticians ami 
Biometriciani, now in the press, to be issued for the Biometric 
Laboratory of University College, London, by Messrs Dulau it Co. 

(7) DAViaiPORT, C, B., Slatiatvial Method), with apecial Ttftrence to 

Biolofficai Variaiioa ; New York, John Wilej ; London, Chapman k 
Hall ; second edition, 1904. (Tables of ares and ordinatea of the 
normal carve, gamma functions, probable errors of the coefficient of 
correlation, powers, togarithma, etc.) 

(8) EtDERTON, W.F., " Tables for Testing the Goodness of Fit of Theory to 

Observation," Bwmetrika, vol. i., 1S02, p. 1S6, 

(9) EvKBiTT, P. F., "Tables of the Tetrachorio Functions for Fourfold 

Correlation Tables," Bifrmetrika, vol. vii., 1910, p. 487. (Tables for 
facilitating the calculation of the correlation coefficient of a fourfold 
table by Pearson's method on the assumption that it is a grouping 
of a normally distributed table ; cf. ref. UofChap. XVI.) 

(10) Gibson, Winifred, " Tables for Facilitating the Computation of Prob- 

able Errors," Biometrika, vol, iv., 1906, p. 386. 

(11) Heron, D,, " An Abac to deteimme the Probable Errors of Correlation 

Coefficients," Biometrika, voL vii., 1910, p. 411. (A diagram giving 
the probable error for any number of observatione up to 1000.) 

(12) Lbe, Alice, " Tables of /[r, y) and H{r, v) Functions," British Asmcia- 

lion B^ort, 1899. (Functions occurring in connection with Professor 
Pearson's frequency curves.) 
(13 Rhind, a., " 'Fables for Facilitating the Computation of Probable Errors 
of the Chief Constants of Skew Frequency-distributions," Biometrika, 
vol. vii., 1909-10, p. 127 and p. 386. l . 



354 THEORY OF STATISTICS. 

(U)SHBrpAKD,W.F., "IT«w Tables of the Probability Int^^l,''£ioin«fti:it(i, 
vol. ii,, 1B03, p. 171. (Includes not merslj tableof oreuof thenami^ 
□DTVfl (to Bereu Ggnres), bnt also a table of tbe ordinates to tii6 eame 
degree of aecntacj.) 

(IB) Shbppakd, W. F., "Table of Deviates of the Normal Curve" (with 
introdnctory aitiole on Grades and DtviaUs by Sir Francis QalCon), 
Biomelrika, vol. v., 1807, p. 404. (A table giving the deviadon of 
the noriDsl curve, in t«nnB of the atandard-deviation as uuit, for the 
ordinates which divide the area into a thousand equal parts.) 



n,gN..(JNGOOglC 



APPENDIX n. 

SHORT LIST OF WORKS ON THE MATHEMATICAL 
THEORY OP STATISTICS AND THE THEORY OF 
PROBABILITY. 

Thb student may find the following short list of service, as 
supplementiDg the liatB of references given at the ends of the 
several chapters, the latter containing, as a rule, original memoirs 
only. The economic student who wishes to know more of the 
practical side of statistics may be referred to Mr A. L. Bowley's 
"Elements" (6 below), to An Elementary Manual of StatiBties 
(Macdonald & Evans, London, 1910), by the same writer (useful 
as a general guide to English statistics), and to M. Jacques 
Bertiilon's Cows elementaire de ttatietique (Soci4t6 d'^ditions 
acientifiques, 1895: international in scope). Dr A. Newsbolme's 
Vital Stadsties (Swan Sonnenachein, 3rd edn., 1899) will also be 
of service to students of that subject. 

All the works mentioned in the following list, with others which 
it has not been thought necessary to include, are in the library 
of the Boyal Statistical Society. 

(1) AiEY, Sir G. B., On Ihe Algebraical a/ad Numaieal Theory of Errort of 

ObsBraatiom; latedu., 1SS1 ; 3rd edn., 1879. 

(2) BiKHoOLLI, J., Ara eonjeOandi, opm poithumma: Accedit tractatua de 

aeriOna infinitU, el epistola gall-ki leripta de Ivdo pila« reiicularCs, 

171S. (A Qenuan tranelstion in Oetwald'a KlaaMer der exaJOen, 

TVissmKli^fttn,, Nos. 107, 108.) 
(8) Bbbirabt), J. L. F., Calad'.des probabilitis ; Gautliier-Villars, Paris, 1889. 
(1) BoKM., E., Sl^nKntl de la Ih^orie dea proboHliUt ; Hermann, Pmib, 

1B09. 

(6) BOWLBY, A, L., EUments of Statistic; P. S. King, London ; iBt edn., 

1901 ; 3rd edn., 1907. 
(8) BatillH, H. , WaJirichftinliehkeitsTechnung wad KolWcUvmasslehre j 
Teubnar, Leipzig, lfl06. 

(7) CocBNOT, k. A., ExpotitioTi de la IMorie dei chances et dea probabUUis, 

1S43. 

(8) OznBER, E,, WahracheinliehJceUmechiiii/ag und ihre Anvisitdung aiaf 

FehlerausgleiAiiiag, Siatisiik uiid Leheaeversicherv/ng ; Tenbner, 
Leipzig, 2nd edn,, voL i, 1908-10. o |c 



356 THEORY OF STATISTICS. 

(9) DkHoeoah, A., Treatise <mth^ Thtory of ProbabHUua (eTtncied from 
ths Enq/el^padia Mttropolitana), 1837. 

(10) Eldkrton, W, p., Frajaau^ Ourves and Correlation ; 0. 4 K Lsytou, 

Lobdon, IMS. (D«aU with Professor Paarson's &eqnenc; curves and 
ooiTelalioii, Tith illustrationB chieSy of octnaml intenat.) 

(11) Feobner, O. T., KeUtkUiimiuslehTe (posthamonBly pablished ; edited 

bj Q. f. LippB) ; Engelmsnu, Leipzig, 1S97. 

(13) Gailowat, T., TTeaiise on J^obabilily (repuljliahed from the 7th edn. 

of the EiieydopiEdia Briiannita), ISSB. 
(18) Gaubb, C. F., Mithodt det moiadTei earria: Mimoira gur la comMnaitotl 
dtt obaervatioTis, traduitaparJ. Bertrand, 1SGC>. 

(14) Laplace, Pikrke Sijcon, Marqaia de, Eisai philosophipit sur Ut 

probabiliUi, 1814. (The introduction to IG, separately printed with 

some modifications. ) 
(IG) Laflaob, Pibbre Simon, Marqnis de. Throne analytiqve deaprobabitU^ ; 

and edn,, 1814, with snpjilementa 1 to 4. 
(Ifl) Lkzis, W., Abhandlungen itur Thtorit der Btvelksmngs- un/{ Jiorat 

Oatiata ; Fischer, Jena, 1S03. 

(17) PoiNOAKd, U., Calcui dea prdbaMlitia ; Qauthier-Villare, Paris, 18S6. 

(18) PoissoN, S. D., Becherchta sut la probabilM des jugementa en miOiiiTe 

erimitulU et en maliirt civile, -pricidiaa dea riglea gindrales dv. ealeul 
dea probabilOit, 1837. (German tranalation bj C. H. Sohnnse, 1841.) 

(15) QlTBTKLET, L. A. J., LettTaa iwr la thiorie det probabil'Ma, appliqtUe ausc 

tciencet moralea et poliiiques, 1846. (EngUah translation by O. G. 
Downes, 1849.) 

(20) Thoknbike, E. L., An IntTodaetiim to the Theory of Mental and Social 

Meaawemenla, Science Fresa, New York, 1904, 

(21) Vbhn, J., The Logic of Chance: an Siaay on Che Foandatioaa and 

Province of the Theory of Probability, mith eapecial T^ereiux to Ha 
Logical Beari-nga ajid iis Application to Moral and Social Science and to 
Slaiialica; 3rd edn., Mocmillan, London, 1888. 
(£2) Wetteroaabd, H., Die Orvindxagt dm- Theorie der Statistik ; Fisoher, 
Jena, 1890. 



n,gN..(JNGOOglC 



ANSWERS 

TO, AND HINTS ON THE SOLUTION OF, THE EXERCISES GIVEN. 



I. N 2fl,287 {AB) 887 

(A) 2,308 {AC) 874 

(B) 2.8B3 (SO 363 
(O 7*9 (,ABC) 146 

% (ABO 168 (ofiCO 179 

(ABy) 431 (^y) 1,2*9 

(-ifl6') 272 (nflO 188 

(A^) 766 (way) 20,604 

3. Tbe freqneooteB not given in the questdon itself are — 

(*) {AB) 107 {AC) 406 {BO 625. 

(6) (^fty) 22,880 {oBy) 13,686 (aSC) 96,478 («9t) 28,888,495. 

4 <^)>(£) . (-^g) , (g> 

(^«) (flj ■■ WB) + (^fl) (S)+W 

6. <.4£) + (£C)-<S), i.e., t}iesllmofthesiceBBeaof(vl^Bnd(£f7)over(S)/2, 
8. 180. Take ^ = huabond exoeeding wire in first measurement, B = 
husband exoeeding wife in second mesenrement, and find {aB). 

CHAPTER II. 

1. 80/263 or 804 per tboaBand. 

2. 56/85 or fl5 per cent 

3. 32 per oeut. and SO per cent. 

4. 117. 
6. 108. 

8. p>l (l-25),p<i{l + 2}),i,e.,j)mmt lis between and J (1 - 2j) or 
between ^ (1 +2$) and \. 
0. Ae a hint, ramember the condition that — 



THKOET OF BTATISTICa. 



CHAPTER III. 



1. Desf-muEeB from childliood per million among malea 222 ; among 
famales 18S ; there is therefore positive usociation between deaf-ma^Bm and 
male hx ; if there hod been do associatioa between deaf-matism and eez, there 
would hare been 3176 mole and 3SB3 female deof-muteB. 

2. (a)positiveassociation, since (^^)g — 1167. 

(6) negative asBociation, ainoe 284/460 = 3/5, 380/570 = 2/3. 
(e) independence, since 256/768 = 1/3, 48/141 = 1/3. 

3. Peroentage of Plants above the Average Height. 

Pareatiige Crossed. Self-fertilised. 

IpomBa piirpima . . S6 per cent. 2fi per cent. 

Fettmia violaoea . . . 7S „ 17 „ 

Reseda Intea ... 78 „ 34 ,, 

Reseda odorata . . 71 , , 46 , , 

Lobelia folgene . . . 50 ,, 36 ,, 

The association is mach less for the species at the end than for those at the 
beginning of the list. 

4. Petceotage of dark-eyed amongst the sons of dark.eyed fathers 39 per 

Percentage of dark -eyed amongst the hodb of not dark-«;ed others 10 per 

If there had been no heredity, the frequencies to the nearest unit would 
havebeeuM^lS, Ud),111,(a£}ol21, (a3jo760. 

E. Percentage of light-eyed amongst the wives of light-ejed husbands 69 
per cent. 

Percentage of ligbt-eyed amongst the wives of not light-eyed husbands 63 
percent. 

Itthere had been no association: (^5),=298, (^fl),=226,(afi)(,=H3,(«B)„ 
= 108. 

fl. The following ate the proportions of the insane per tboaeond in 
successive age greaps : — 

In general population: O'B, 28, 41, B7, 89, 7'6, 7'7, 8*8. 
Amongst the blind: 20-1, 19-0, lfl-8, 20-7, 18-3, 17-8, 11-4, 6-8. 

Kote the diminiabing association, which is especially clear in the aga-erenp 
65 — , and the ne^^tlTe association in the last age-^oup. The assotuatian 
coefficient gives the values below, which decrease contumonsly : — 

Association coefficient: -f-0-92, -^0■76, ■^0■fll, -t-0'67, -HO'ie, +0-41, 



CHAPTER IV. 



{fiD)m = 3-6 
{A0D)HA0) ^il-2 

i,BD)/(B) =42-7 
(ABD)KAB)^61-e 



(,A)IN = a-8per« 

{AD)HD) =44-6 

{A»)I{B.) = 1-7 „ 

Mfli>)/WZ))=E4-B „ 

lAB)l(B) =20-8 „ 

(ABD)l{BD)=Zl-S 



The above give two legitimate comparisons. The general results are the sune 
as for the boys, ie. a very small assooiation between development-dafeotB and 
dulnees amongst those exhibiting nerve-signs, as compared with these who do 



ANSWEBS, ETC., TO EXEBCISES GIVEN. 359 

not exhibit nem-aigna, or with tlie girls in general. Ab the asBOQi&tioii 
amongst those who do not exhibit nerre-aignB is quite as high w for the girls 
in general, fha ' ' coQclnmon " quoted does not seem valid. 



thousand, thousand. 
(B)IN 3-2 7-6 

Ub)I{A) 14-9 117 



The above give the two umplest oompaiisons, either of vhich is snffloient to 
show that there is a high assooiation between blindness and mental derange- 
■ ment amongst the deaf-mutes as well as in the general population ; omoneet 
the old, the assooiation is, in &ct, small for the general population, but wSl. 
marked for deaf-mutes. This result atanda in direct contrast with that of 
<ju. 1, where the association between the two defects A and D was mnoh 
smaller in the defective nnirerae B than in the universe at large. As previoualy 
stated, DO great relianoe oan be placed on the census data as to these iafiimicies. 
S. Ifthecancer death-rates for farmers over 16 and under 15 respectlvelj 
irers the same a» for the population at large, the rate for all fanners IS — 
would bel-ll. This is j%U;^ leas than the actual rate I'SO, bat the excess 
notltd not justi^ the statement that ' ' fanners were peculiarly liable to oancer. " 
It is, in point of fact, due to the farther differencea of age-distributiou tiiat we 
have neglected, e.g. amongst those over 4G there are more over 5B amongst 
farmers than amongst the general population, and bo on. 

4. IG percent. 

5. If ^ and B were independent in both C and y universes, we would have 
{A B) equal to 

471x419 151xlB9 _ 

"-Si7~"^~383 "*^- 

Actaall7(.rf£)only = 85S. Therefore <f and £ must be disassociated in one or 
both pui^al universes. 

9. (1) 6S'l per cent. (2) 42 G per cent The fallacy discussed in g 2 ia 
now avoided, and there seems no reason for declining to consider this as evidence 
-of the effect of expenditure on election reaulta. 

10. The limits to;/ are— 



aubjeet to the conditiona jij»«, ^0, y-ii^ - 1. No inference of a positive 
associatian from two negatiTes is possible unless x lies between the limits 
■882 . . . , -618 .... 
1 1. The limits to y are : — 

<1) y<V.^x-^--\.) 

BubjecttooonditionB y^O, -J^ix-l, 3>«. 

An inference is only possible from positive association s of ^5 and ACHx^ 
) ; an inference is only poaaible from two negative associations if x lies between 



aabjeot lo conditionB y^^O, ^;5a:- 1, J»3V ,-, , 

n,gN..(JNCjOOgle 



THEORY OF STATISTICS. 



No iaference is posaible from poaitii 
&D inferance h onlj poauble from neg 
■183 .... and ■21B .... Note that 



•otgeet to the conditions y-^O, ^fBa;-!, ^0, 

As in (2), no inference is possible from [roaitive associations of ^CanU BG; 
■aiDfereDceis possible from negativeossociatianB ifxliebetween -177 .... 
and '221 .... Note that x cannot exceed |. 



CHAPTER V. 
1. J, 0-68. B, 0-86. 

CHAPTER Vr. 

1. 1200; 200. 2. 100 > 20. 3. 14S'26. 4. 216-6. 

CHAPTER VII. 

2. Mean, 15373 lb. Median, 151-67 lb. Mode (approx.) IGO-S lb. {Hote 
that tlie mean and the median should be taken to a place of decimals fiuther 
than is desired foT the mode : the trae mode, found by fitting a theoretieal 
frequency ciutb, is IBI'I lb.) 

3. Mean, 0-6330. Median, 0-6S91. Mode (approx.), 0-661. (Trae mode 
is 0'653.) 

4. £35-6 approximately. 

5. (1) lie-0. (2) Means 77-4, 8B'0, ratio 114-9. (3) Geometrical means 77*2, 
88-B, ratio 115-2. (1)115-2. 

6. (1)921,607. (2)916,983. 

7. Ist qual. lOe. 6id. 2nd qual. 9b. 2^. 

8. n,p. If the terms of tJie given binomial series are multiplied by 0, 1, 2, 3 
. . . , note that the re9ultln<{ series is also a binomial when a common factor 
is removed. [The full proof is given in Chapter XV. g 8.] 

CHAPTER VIII. , 

2. Standard deviation 21'3 lb. Mean deviation la-l lb. Lower qnortile 
142-5, upper quartile 168-4; whence ©=12*85. Ratioaj m.d./s.d. = 0-77, 
e/B,d.=0-61. akewness, 0-29. 

8. Approximately loner quartile = £26 -I, npper quartile =£61 '8, ninth 
decUe = £94. 

5. (1) if=78-2, (r=17-3. (2) ir=73-2, ff = lT6. (3) Jf=78-2, ^ = 18-0. 
(Note that while the mean is unaffected in the second place of decimals, the 
standar d dev ialjon is the higher the coarser the grouping. ) ' 

6. V-Pff. The proof is given in Chapter XV. § 6. 

7. The assumption that observations are evenly distributed over the 



ANSWERS, ETC., TO EXBECISES GIVBK. 

intsrralH does not affect the aam of deviatioiia, e 
the mean oi' median lies : for that interval the si 
entire correction is 

In this eipregsion d is, of course, eipreased *8 * ftiLCtion of the clMS-intemd, 
and is given its proper sign. Notice that the n^ and n^ of this qneation are- 
not the same as the !fi and N^ of g 16. 



CHAPTER IX. 

1. ff,= l-4U, ffj=2-280,r= + 0-81. X=0-5r+0-5. r=l'8Jf+l-l. 

3. Using the snbssripts 1 for earnings, 2 far pauperism, S for oat-relief ratio. 



CHAPTER XI. 

1. 1-232 per cent, (against 1 '240 percent.): 2-BBS in. against 2'672in. 

2. The corrected standard-deTiatioa ia 0'9S54 of the rough value. 

3. Eatiinated trae standardnieviation 6 '91 : atandard-i^yiation of fluctua- 
tions of sampling 9'SS. (The latter, which can be independentlj calculated, 
is too low, and the former consequently probablv too high, (y. Chap. XIV. 

BIO.) 

4. 0-43. 



6. lTl*/^/^^?^^o?)(^+»?)■ 




<n down from symmetry. 
I. (1) No effect at lU. (2) If the mean value of the an 
d, and in the weights e, the value found for the weighted m 



If r is small, d is the important term, and hence errors in the qnautitieB aie 
usnally of more imDortanoe than errors in the weights. If r become con- 
ddetable, errors in the weights may he of consequence, but it does not seem 
probable that the second term would become the most important in pnioticai 



CHAPTER XII. 



■0769,7'„,a=-H0-097, r^i-- 
'6i, (ri.u = 0-6S4, o's.is=70-] 
■31 -h3'87 Jrs-0-008fl4 X,. 



N Google 



862 THSORT OF STATISTICS. 

2- '■i«-»i=+f"680,ru.u= +0-803, r,4.B=+0'397. 
^»u= -0'«8, r^i,= - 0-658, r„i3= - 0'149. 
<r,.a,=9-11, »^,„=iB-2, iTm, = 12-5, o-j.,^ = 106'*. 
X, = 58 + 0-127 £3 + Q■6^^ Xa + 0'0346 Z^. 
3. Tbeconelation of the^horder is r/(l+}>r). Henoe if r be nemtive, the 
correlation oC ordef n - 2 oannot be numerioitliy greater than nmty and i 
csonot exceed (□umerioslly) l/(7> - 1). 

i. -Tjf 

.1=+!. 



CHAPTER XIII. 

1. Theo. M=e. o = l-782! Aotaal Jf=6-116, rT = 1732. 

2. (a) Theo. ^=2-6, (r = l'118 : Actual J/=2-48, ^ = 1-1 4. 
(J) „ jtf^S, <r = l-226: „ Jf=2'97, ^ = 1-26. 
(c) ,, if^3-6, ff = r323 : ,, Jf=3-17, <r = l'40. 

3. Theo, Jf=50, (r = B : Actual if=60'll, ff = 5'23. 

4. ThB atandard deviation of the proportion is O'OOITS, and the actual 
diTergeuce is Ci'4 timeB this, aad therefore almost certsiulj significant. 

5. The standard deviation of the number drawn is 32, and the actual 
^iffersnce from expectation 16. There is no Bignificance. 

e. p = l-a^lM,n^Mlp :p^0'610,™ = 12-0:p = 0'454, 71=110-4. 

S, Standard deviation of simple sampling 23 -0 per cent. The actn&l 
standard-deviation does not, therefore, seem to indicate any real variatiau, but 
only fluctuations of sampling. 

9. Difference from expectation 7'C ; standard error 10*0. The difTereQCB 
might therefore occur frequently as a fluctuatian of sampling. 

10. The test can be applied either by the formulie of Case II. or Case III. 
Case II. is taken as the simplest. 

[a.) [AB)j{B)^69-\ per cent: (^^)/(^) = 80-0 per cent. Difference 10-9 

Sroent. (.^)/A'=71'l percent and thence «,j=12'B per cent The actual 
Serenoe is less than this, and would frequently occur u a fluctuation of 
simple sampling. 

(t) [.i<B)/(a) = 70'lpBrcant:(^fl)/(fl) = B4-3 percent Difference 6'8 per 
■cent {AyN-67-6 per cent., and thence ti3=3-40 per cent The actual 
difference is 1 *7 times this, and might, rather infrequently, occur as a fluotoa- 
tion of simple sampling. 

CHAPTER XIV. 

1. Row. Op. Qronp of Howt vp. 

1 8-1 B, S. and? 1-8 

2 2-1 8, S, 10, and 11 I'S 
S 1-7 12, IS, and 14 12 
4 2'7 IG and upwards 1*1 

Vf 18 given in units per 1000 births, as s and Sg. 

2. 30 = 7-02, and ffp=2-6 units. 

8. ^=n.pq as if the obaaoe of success were p in all cases (but the maan is 
«/2notp.n). 

4. Ueau namber of deaths per annum = s'g'= 690, 

,s=668,682. r=0-(H 



■, Goo»^lc 



ANSWERS, ETC., TO KXBRCISBa GIVKN. 



CHAPTER XV. 



2. The freqaenoy of r successes is greater than that of r-1 so long » 



6. The data are Jf =68-866, ^=2-68, y^=lbi-S. 

6. (!) United Eiugdom— direct 1'7C, JiomataDdard-demtion 1'78. 
(2) Cambridge Btndents — direct 1*68, from standord-deviatioii 1'73. 

7. 70'6peieent 8. 27 percent 

9, (1) In a IS'4 peroent., i I'O percent, of the trials, asaaming nonniility, 
bnt the aasamption is hordlj guit« valid. [2) a abont IS times in 100,000 
trials ; b pnctioall;f impouible, being a dentation of orer 7 timee the stanoatd 

10. SG3. II. Mean 74-8, standard-deviation 3-23. 



CHAPTER XVI. 

S. From ec|i)atiODS(10)and(ll} replace a-y and vg b; 2i and S* iu equation 
(B). BegudiQ^ this as an equation for r, note that r^ is a niajimnm when 
(on 2 is infinite, or »=4B°. C i)(lok' 



364 THKOEY OP STATISTICS. 

i. In fig. 50, suppose sver; liorizoDt»l utk; to b« given & slide to the right 
until its mean lies on the vertical axis through the mean of the whole distribu- 
tion : then suppose the eUipees to be squeezed in the direction of this verticat 
axis until the; become circles. The original quadrant has now become a. 
■ector with an angle behveen one and two right angles, and the question is 
solved on detenniuing its magnitude. 



CHAPTER XVII. 

1. Estimated frequency 1512, standard error 0-2S ]b. 2. Lower Q, 
freqaencj 1472, standard error 026 lb. ; upper Q, frequency lllS, standard 
error 0-S4 lb. S. O'lS lb. i. 0-21 lb., 17 per cent, less than the staudud 

arror of the median. 6. 00 ■..-.■ 

the stani^kTd error of the si 
iMge. 



n = 100. 


n=1000. 


0-1 


0-0316 


D'096 


0-0304 


008* 


0-0266 


0-064 


0-0202 


0-038 


0-0114 



n,gN..(jNGoogle 



INDEX. 



[The references are to pages. The subject matter of tbe Exercises given at 
the ends of tbe chaptora has been indexed only when such exerciees (or 
tbe answers thereto) give the oonstanta for statistical tables in tbe text, 
or theoretical results of general interest ; in all such cases the nnmliec of 
the qnestion cited is given. In the case of authors' numes, citations in 
the text are given first, followed b; citations of tbe authors* papers or 
books in tbe lists of references.] 



AooiDBNT, deaths from (law of small i 

chances), 261-262. 
Aohenwall, Gottfried, AbrUs der ' 

SlaatsuAasensehaft, 1. 
Ages, at death of certain women i 

(table), 78 ; of husband and wife i 

(correlation), 169 ; diagram, 173 ; 

constants (qu. 3), 189. 
AggregatB, of classes, 10-11. 
A^cmtural laboums' earuings. See 

Earnings. 
Airy, Sir G. B., use of terms " error 

of mean square" and "modulus," 

1«. Kefs., Theory of Errors of 

Observat'Um, 366. 
Ammon, O., hair and eye-colour data ■ 

cited from, 61, 
Annual value of dwelling -bouses 

(toble), 83; of estates in 1715, 

table 100, diagram, 101. 
Arithmetic mean. See Mean, arith- 

Array, def., 164 ; standard-deviation 
of, 177, 20*, 232-233, in normal 
correlation, 815-316. 

Association, generally, 25-59; def., 
2S ; degrees of, 29-30 ; testing by 
comparisou of psrc«ntages, 30--3B ; 
constancy of differenoe from in- 
dependence values for the second- 
ordBT fteqaencies, 36-38 ; co- 
efficient of, 87-88; Uliisory or 



misleading, 48-51 ; total possible 
number o^ for n attributes, 61-66 ; 
case of complete independence, 
66-57 i use of ordinary ocrrelation- 
coefiicient as measure of assooiatioa, 
212-218 ; Pearson's coefficient based 
on normal correlation (refs.), 39, 
329 ; refs., 16, 39, 32S. 
Association, partial, generally, 42-59 ; 
the problem, 42-43 ; total and par- 
tial, def., 41 ; aritbraetical treat- 
ment, 44-48 ; testing, in ignoranoe 
of third-order frequencies, 61-64 ; 

examples; deaths aod sex, 32- 

33 ; deaths and occupation, 62-58 ; 
dcaf-matism and imbecility, 33-31 ; 
eye'Cotour of father and sou, 34-35 ; 
eye-colour of grandparent, parent, 
and offspring, 16-48, 53-54 ; colour 
and pricklinessof Doiit™ fruits, Sfl- 
37; defects in school -children, 45-46. 

Asymmetrical frequency-distributions. 
90-102; relative positions of mean, 
median aud mode in, 121-122, 
diagrams, 113-111. See also Fre- 
quency-distributions. 

Asymmetry in frequency ■distribu- 
tions, measures of, 107, 149-50. 

Attribntes, theory of, generally, 1-59 ; 
def., 7; notation, 9-10, 11-16; 
positive aud negative, 10 ; ^v^^ad 



THEORY OF 8TATI8TICa 



aggragate of olawes, 10-11 ; olti- 
mate ctaasee, IS ; poaitiT« oUsses, 
13-11 ; consUteucB of clsM-frs- 
quenciei, 17-24 (lee Conaatenee) ; 
aSBociatiaii of, SG-SB (see AwocU- 
tJon); gampliiig of, 2G0-8S0 {lee 
Sampling, of attribatea). 
Avara^ generally, 106-82; dof,, 

107 ; dBBirablfl properties of, 107- 

108 ; forms of, 103 ; average in 
senae of arithmetic mean, lOB ; 
nb., 12S-130. Ste Mean, Median, 
Mode. 

Aim, principal, in correlation, 317- 
318. 

Barlow, P., tables of sqaares, etc., 
67. Bef9., 3G2. 

Barometer heights, table, Sfl ; dia- 
gram, 97 ; means, medians, and 
modes, 122. 

Bateman, H., ref»., law of small 
chances, 269. 

Batsaon, W., data cited from, 87. 

Beeton, Miss H., data cited from, 7S. 

Bemoolli, J., reik, Ars Cr/njeelandi, 

Bertillon, J,, ref., Covrt ilimenteaTe 
de ilatiMtique, 36G. 

Bertrand, J. L. F., refs., Caieui des 
probabiliiis, 365. 

Bias in sampling, 2S7-2ES, Zlh-211, 
332, 339, 348. 

Binomial aeries, 287-296 ; geneais of 
in sampling of attribntes, 287-289 ; 
calculated serieafar different values 
of p andn, 290, 2S1 ; experimental 
illuatratioDB of, 264-256, (qu. 1 
and qu. 2) 270 ; graphic method 
offorminga representation cf aeries, 
291-2S8; mechanioal method of 
forming a Tepresentation of series, 
293-29E, refa., 309 ; direct deter- 
mination of mean and standard- 
deviBtion, 296-298 ; deduction of 
normal ourre from, 297-288 ; refa., 
310. 

Blakeman, J., refs., teste forlinearitj 
of regression, 200 ; probable error 
of contingenc; eoefficieot, 349. 

Boole, G., refa., LawM of Tlumght, 
23. 

Booth, Charles, on pauperism, 193, 
195. 

Borel, E., rafs., Tkiorii da proba- 



of 



Bortkewitsch, L. von., refs., li 
small chances, 269. 

Bowlej, A. L., refs., effect of errors 
on an average, 360 ; on sampling, 
S49 ; Meaauremeni of Groups and 
Series, 349; EUrnenU of Statistiea, 
356 ; Slemeniory ManiuU of Sta- 
lisiies, S5G. 

Bravaie, A., refs., correlation, 188, 
338, 

British Association, data cited from, 
sUtnre, 88 ; weight, 96, aa SUtore, 
Weight ; Eaports on indei-num- 
bers; mla., 13C-181 ; Address bj 
A. L. Bowley on sampling, 84S. 

Brown, W., refs., effect of experi- 
mental errors on the correlation- 
coefflcient, 222. 

Brans, H., refs., Wahmc^inHch- 
keitsreehnwng v/nd KollekUvmas*- 
Uhre, 3SG. 

Cbnsus (England and Wales), tabu- 
lation of inQrmities in, 14-15 ; data 
as to infirmities cited from, 3S-S4 ; 
claasiEoatian of ocoapatlons, as 
example of a heterogeneous dassi- 
fioation, 72 ; data aa to ages of 
husbands and wives cited from, 159. 

Chance, in sense of complex canaaticn, 
SO ; of success or failure of an 



Chances, law of small, 261-262; rB&,, 

269. 
Charlier, 0. V. L., rets., theory of 

frequency curves, resolation <^ a 

compound normal curve, 310, 311. 
Childbirth, deaths ia, application of 

theory of sampling, 278-280. 
Class, in theory of attribntas, 8 ; 

clafls-symbol, 9 ; olaBs-fTequen<rf, 

10 ; positive and negative dossce, 

10 ; ultimate classes, 12 ; order of 

a class, 10. 
Classification, generally, 8 ; by dioho- 

tomy.def., 9; manifold, 60-74, 76; 

homogeneous and betert^jBDeoDS, 



78, 80-81, 167, 164. 
Class -interval, def., 76 ; choice of 
magnitude and position, 79-80 ; 
deau^biljty of equality of intervals, 
76, 82-83 i'inSuence of msgnitode 
on mean, 113-114, 115, 116; on 
standoid-deviatioD-, JdO, 208. 



Clondiuess at Brcelau, frequency dis- 
tribntioD, 103 ; diagcam, 101, 

Coeffldent, of nasooifltion, 37-38 ; of 
coutiiigecoy, 64-67 ; of rariation, 
149, Btuidanl error, 347 ; of cor- 
relktioii, set OoTrelfttion. 

Coamteaav, of claae-trequenai«8 for 
attributes, generally, 17-24; def., 
13-]E>; conditiiiDa, for one or two 
attributsB, 20 ; for tbree attribates, 
21-22; refe,, 23. 

CondBtence of correlation-coefficients, 
246-247. 

Contjngency tables, def., 60; treat- 
ment of, by elementMT methoda, 
61-68 ; isotropy, 08-71, 324-327. 

coefficient of, 04-67 ; applica- 
tion to correlation tables, 167, (qn. 
S) 189 1 standard error of (refs.), 
S19. 

Coatraiy claasea and frequencies (for 
attributes], 10 ; case a* equality of 
contrary frequencies (qn. 6, 7, 8), 
16 ; (qu- 8), 24 ; (qu. 7, 8, 9), 59. 

Correction of death-rates, etc , for 
age and sex -distribution, 219-221 ; 
re&., 222, 

of standard -deviation for group- 

iug of obserrations, 208 ; reis. 
(including corredaon of momei^tB 
generally), 221. 

of correlation-coefficient for 

errors of obserFatioii, 209-210; 
refe., 221-222. 

Correlation, generally, 157-249 ; cou- 
structiou of tables, 164 ; represen- 
tation of frequenoy-distribution by 
surface, 165-107 ; treatment of 
table by coefficient of contingency, 
107 ; correlation- coefficient, 170- 
174, def. 174, direct deduction 
227-229; regressions, 175-177, 
def. 176 ; standard -deviationa of 
arrays, 177, 204; calculation ol 
ooemcient for ungrouped data, 177- 
181, for a grouped table, 181-lSS : 
between moremen taof two variables. 
197-201 ; elementary methods for 
cases of noD-linsar regression, 201- 
202 ; rough methods for estimating 
coefficient, 202-205 ; correlation- 
raUo, 205 ; effect of errors of ob- 
servation on the coefficient, 209- 
210 ; correlation between indices, 
211-212; coefficient for a fourfold 
table, direct, 212-213, on assump- 



IX. S67 

tion of normal correlation (Pearson's 
coefficient) (refs.), 89, 829 ; for all 
possible pairs of 2f values, 213- 
214 ; correlatioii due to hetero- 
geneity of material, 214-215 ; effect 
of adding uncorrelated pairs t<i a 
given table, 21fr-216 ; application 
to theory of weighted mean, 216- 
218 ; correlation in theory of sam- 
pling, 267, 382-285, 338, 345- 
S46 ; standard error of coefficient, 
348. Refs., 188,205-206,221-222. 
For Illustrations, Normal, Partial, 

Correlation, Illustrations and Ex- 
amples, corraUtton between :— 

'Pwo diameters of a shell (Ptelen) 
168; constants (qu. 3), 189. 

Ages of husband and wife, 159 ; 
diagram, 178 ; constanta (qu. 3), 
189. 

Statures of father and son, 100 ; 
diagrams facing 166, 174 ; con- 
stants (qu. 8), 189 ; testing nor- 
mality of table, 318-324 ; diagram 
of diagonal distribution, 321 ; of 
coQt«ur-line8 fitted with ellipses of 
normal surface, 823. 

Fertility of mother and daughter, 
101, 195-190 ; diagram, 176 ; cos- 
stants(qu. 3), 189. 

Discount nies and percentage of 
reserves on depoaite, 102; diagram, 
facing 166. 

Sei-ratio and numbere of births 
in dia'erent districts, 103, 176; 
diagram, 176 ; constants (qu. 3), 
189 ; standard-deviations of arrays 
and comparison with theory of 
sampling, (qu. 7) 271 and, (qu. 1) 
285. 

Earnings of agricultural labour- 
ers, pauperism and out-relief, 177- 
181 ; consUnts, [qu. 2) 189, 236 ; 
treatment by partial correlation, 
235-237 ; geometrical representa- 
tion, 241-243. 

Old-age pauperism and ont-relief, 
182-185. 

Changes in pauperism, out-relief, 
proportion of old and populatiou, 
192-1S6 ; partial correlation, 237- 
241. 



Weather and crops, 196-197. 



THEOET OF STATISTICS. 



MoTementa of in&ntile and 
genenl mortality, 107-190- 

UovsmentB of marriage- rate and 
foreign trade, 180-201. 
Oirrelation, nonuol, 313-330; dedoc- 
tian of eipieasion for two variables, 
310-815 ; coiutaDcy of standard- 
deviation of anaya and linearity 
of ngresaioD, 316-316; oontonr 
linea, 31S-317 ; normality of linear 
funotiona of two normally diatri- 
bnted vaiiablea, 817 ; principal 
sz8a,S17-3l8 ; tAatingfarnormality 
of correlation table for ataturo, 
318-32i ; isotropy of normal cor- 
relation table, 324-327 ; outline 
of theory for any number of 
variables, 327-323 ; coefficient for 
a normal diatribution grouped to 
fourfold form round medians 
(Sbeppatd'a theorem}, (qn. i) 330 ; 
applications to theory of qualitative 
obaervationa (refa.), 329- Refs., 
328-820. 
partial, 226-249 ; the pro- 
blem, partial regressions and cor- 
relations, 226-227 ; notation and 
definitions, 22B-230 ; normal equa- 
tions, fHindamental tbeorems on 
product-sums, 230-231 ; signifi- 
cance of generalised regressions 
and corrulations, 232 ; reduction 
or standard- deviation, 232-233, of 
regression 233-2S4, of correlation 
29i ; arithmetical treatment 234- 
241 ; representation by a model, 
241-243 J coefficient of n-toid cor- 
relation, 243-246 ; expression of 
correUtiouB and regressions in terms 
of those of higher order, 246-240 ; 
consistence of coefficients, 243-7 ; 
(allaciea, 247-248 ; limitations in , 
interpretation of the partial correla- . 
tion coefficient, partial association 
and partial correlation, 243 ; par- 
tial correlation in case of normal I 
distribution of frequency, 327-328, 
Refs., 248-240, 328-32B. 

ratio, 206 ; refs., 206. 

Cosin, values of estates in 171G, 
100. 

B., refe., multiplica- 
I. 

... refs., theory of 
probability, 366. 
Crawford, G. E., refs,, proof that 



Cotsworth H. 

tion Uble, 35! 
Coumot, , 



aritbmetio mean exceeds geometric, 

130. 
Cralle, A. L., refs,, mnltiplication 

table, SS3. 
Crops and weather, correlation, 106- 



Dakbishikb, a. D., data cited from. 



, 261. 



, Uluati 



) Of 



correlation, II 

Darwin, Charles, data cited from, 
286-6. 

Datura, association between colonr 
and prickliness of fruit, 37, 38, 
(qu. 10) 271. 

Davenport, C- B., data as to Pecten 
cited from, 168. Refs., statistical 
tables, S53. 

Deaf-rantism, association with im- 
becility, 33-34, 38 J frequency 
amongst offspring of deaf-mutes, 
Uble, 104. 

Deaths, death-rates, association with 
sax, 32-33 ; with occupation (partial 
correction for age-distribution), 
62-53 ; in England and Wales, 
1S81-18B0, table, 77 ; from diph- 
theria, table, 98, diagram, 90 ; in- 
fantile and general, correlation of 
movements. 107-109; correction of, 
for age and sex -distribution, 62-63, 



accident, 261-262, deaths in child- 
birth, 278-280, deaths from ex- 
plosions in mines, 283-284 ; in- 
applicability of the theory of simple 
sampling, 266-267, 278-280,281- 



Decilea, 160-162; 






r of. 



Defects ; in school -children, associa- 
tion of, 12, 45-46, refs., 16 ; census 
tabulation of, 14-16- 

De Morgan, A., refs., Formal Logic, 
23 ; Theory of ProbaJniUies, 366. 

Deviation, mean, 134 ; generally, 
144-147; def.,144; is leaat round 
the median, 144-146 ; coloidation 
of, 146-146, (qa. 7) 166-156 ; com- 

Sarison of advantages with stan- 
ard -deviation, 146-147 ; of normal 
curve, 800. 
qoartile. See Quartiles, 



Deviatjon, root'inesu-Bqiure. See 
Devittaon, standard. 

standard, 134-114 ; del. 1S4 ; 

relation to root-niean-aqaare devi- 
ation from any oricin, 131^1 3G ; 
is the least possible roi>t-ni«au- 
sqnare deviation, 13G ; little affected 
by small errors in the mean, 136 ; 
calculation lor un grouped data, 
1SB-1S7, for a grouped diatribu- 
tioD, 13S-141 ; influence of group- 
ing, 140, 208 ; range of six times 
the a.d. oontaina the bulk of thp 
observationa, 140-142, 305 ; of a 
seriaa compounded of others, 142- 
143 ; of JV oonseontive natural 
numbers, 143 ; ofa rectangle, 143 ; 
of arrays in theory of correlation, 
177. 204, 316-3ie ; of generalised 
deviations (arrays), 230, 232-23S ; 
other names for, 144 ; of a anm or 
difi'erence, 207-208 ; effect of errors 
of observation on, 209 ; of an indei, 
2I0-211 ; of binomial series, 295- 
296. For standard-deviations of 
SM Error, stan 
., data cited from, 102. 

Dice, records of throwing, 254-356, 
(qn. 1. 2, 3) 270 ; testing for 
BigniBcauce of divergence from 
theory, 26S ; refs, , 269. 

Diokson, J. D. Hamilton, normal 
correlation surface, 324, Refs., 
normal correlation, 329. 

Diphtheria, ages at death from, table, 
98 ; diagram, B7. 

Discounts and reserves in American 
banks, table, 162 ; diagram, facing 
166. 

Dispersion, measures of, 107, 133- 
150 ; unsnitability of range as 
a measure, 133; relative, 149; 
refs., 154. See Deviation, mean; 
Deviation, standard ; Quartiles. 

Distribution of Frequency. Set Frs- 
qnency-d istributian. 



Basninos of agricultural labourers : 
calculation of standard-deviation, 
185-137 ; mean deviation, 115 ; 
quartiles. 147 ; correlation with 
pauperism and out-relief, 178-181, 
constants, (qu. 2) 189, 235; dia- 
gram, 180 ; by partial correlation, 
235-243; disgram of model, 242. 



369 



Edgeworth, F, Y.,tarmBfor 
of dispersion, 144 ; dioo-throwings 
(Weldon), 264 ; probable error of 
median, etc., 340. Refs., Index- 
numbers, 130-181 ; correlation, 
1H8, 248, 829 ; law of error 
(normal law), 269, 310 ; theory of 
sampling, probable errors, etc, 
269, 319 ; dissection of normal 



B, 311. 



Elderton, W. Pftlin, refs., calculation 
of moments, 164 ; table of powers, 
363 ; Ubles for testing fit, 349, 
363 ; Fregueni:;/ Curvet and Cor- 
relatvm, 154, 366. 

Error, law of ; errors, curve of. See 
Normal curve. 

mean, ] 44. 

mean square, 141. 

— — of mean square, 114. 

probable, in sense of semi-inter- 
quartile range, 117 ; in theory of 
sampling, 306-307. For general 
references, see Error, standard, 
-standard, def., 263 ; of number 



■ proporti 



. of 



events, 252-263, when numbers in 
samples vary, 260- 2S1, when chance 
of success or failure is small, 261- 
262 ; of percentiles (median, quar- 
tilea, etc. ), 333-337 ; of arithmetic 
mean, 340-346 ; of standard- devia- 
tion and coefiicient of variation, 
317 ; of coefficients of correlation 
and regression 348 ; refs., 269, 
849-360. See also Sampling, theory 

, theory of: See Sampling, 

theory of. 

Estates, annual value of. See Tains. 

Everitt, P. F., refs., tables for oal- 
culating Pearson's coefficient of 
correlation for a fourfold table, 
S53. 

Exclusive and inclusive notations for 
statistics of attributes, 1H6. 

Explosions in coal-mines, deaths from, 
as illustrating theory of sampling, 
284. 

Eye-colour, association between father 
and son, 34-36, 38, 70-71 ; associa- 
tion between grandparent, parent 
and chUd, 16-48, 63-64 ; con- 
tingency with hair-colont, SI, 68, 
68-68; nonisotropy of contingency 
table, for father and son, yO-^oli' 



THKORT OF 8TATISTICS- 



370 

Falkker, R. p., lets., tranalation of 
Msitzen'B Thtorie der SlatiKik, S. 

Fsllaciea, in iaterpretiiig axsociatioas 
— theorem on, 48-49, OlustrationB, 
49-60 ; oning to changes of clsssi- 
ficatioD, actual or virtual, 72 ; in 
interpreting correlations — ' ' apuri- 
ous" correlation between indices, 
21 1-212 ; correlation due to hetero 
geneitj of material, 214-216 ; dif- 
ference of sign of total and partial 
correl&CionB, 247-248. 

Fay, E. A., data cited from Mar- 
riagta of the Deaf in America, 104. 

Feobner, G, T., refe, , frequency-dis- 
tributions, averages, meoaureB of 
dispersion, etc., 129, 164; Kol- 
UktimaasslthTe, 12S, 310, 36fl. 

Fecundity of brood-marea, table, 86 ; 
diagram, S4 ; mean, median, and 
mode, (qu. 3) 131 ; inheritance 
(ref.), 232. 

161, 196^196 ;" dia- 
gram, 176 ; conatants, (qu. S) 1S9 ; 
ref., 222. 

Filon, L. N. G„ ref,, probable errors, 
3B0. 

Fit of a theoretical to an actual fre- 
quency •diatribution, testing (ref), 
811 1 toblefl for, 363. 

Fluctuation, measure of dispetsioa, 
144. 

Foantain, H., ref, indei-numbers 
of prices, 131. 

Frequency of adasB, 10, 76- 

Frequency-curve, def, 87; ideal forms 
of, 87-106; normal curve (j.o.), 
297-309; refe., 106. 310. 

Frequency -diatribntions, 76 ; forma- 
tion of, 79-8S ; graphic represen- 
tation of, 83-87 ; ideal forma— 
symmetrical, 87-90, moderately 
asymmetrical, 90-98, extremely 
asymmetrical (J-abaped), 9S-102, 
U'shaped,102-106; binomial series, 
287-29S ; hypergeometrical series 
(ref), 286; nonn^l curve, 2S7- 
309 ; theoretical forms, refs., 285, 
310. See Binomial aeries ; Normal 
curve ; Correlation, normal. 

illustrations : of death-rates in 

England and Wales, 77 ; of ages at 
death of certain women, 78 ; of 
stigmatic rays on poppies, 78 ; of 
annual valaes of dwelling-houses 



in Great Britain, 83 ; of head- 

breadtha of Cambridge students, 
84 ; of statures of males in the 
U.K., 88, SO; of pauperism iu 
diffeient districts of England and 
Wales, SS ; of weigbtaof males in 
the U.K., 96; of facuuditj of 
brood-morea, 96 ; of barometer 
heights at Southampton, 96 ; of 
ages at death from diphtheria, 98 ; 
of annual values of estates, 100 ; 
of petals in Sanu'aevi'ia Iralbosv,^, 
102 ; of degrees of cloudiness at 
Breelau, 103 ; of percentages of 
deaf-mutes in offspring of deaf- 
mutes, 104. See also Correlation, 
illustrations and examples. 

Frequency -polygon, conatruction of, 
84. 

Frequency-surface, forms and ei- 
amplea of, 164-107 ; diagrams 
1B6, fadna 160 ; normal, diagram 
1S6. See Correlation, normaL 

Gabaqlio, a., ref., Teoria gentrale 
delta ttalistiea, 8. 

Galton, Sir Francis, Bereditary 
Genius, 3 ; frequency-distribution 
of consumptivil^, 104; grades and 
percentiles, 150, 152; reRresaion, 
176 ; Galton's function {correlation 
coefficient), 204 ; binomial machine, 
296 ; normal correlation, 324 ; 
data cited from, 34, 40, 70. Befs. 
— geometric mean, 130 ; percentilea, 
154 ; correlation, 188, 328 ; ear- 
relation betweeu indicea, 222 ; 
binomial machine, 309 ; Nalitral 
Iniieritaixce, 164, 3'"" 



, c. 



a of t 



, 144. Refs., normal curve, 
310 ; method of least squares, 366. 
Qeiger, H. , refs. , law of small chances. 



Grad^, 152, 153. 

Graphic method, of representing fre- 
queney distributions, 83-87 ; of 
interpolation for median or pet- 



centiles, 118, 151-152 ; of repre- 
BentinK ooiTelatioii between two 
variablas, 180-181 ; of ostimatiiig 
correlation coefficient, 203-201 ; of 
fontiiag one binomial polygon A'oiu 
another, 291-293. 

Graunt, John, Observations on the 
Bills of Mortality. B. 

Gray, John, data cited from, 266. 

Grouping of obaerrations to form 
frequeoo/ - distribution, choice of 
class-interval, 79-80 ; influence on 
mean, 113-114, 116, 116 ; influence 
on standard -deriatioD, 140, 20S. 



Hair.colour : and eye-colour, ex- 
ample of contingency, 61, 63, se- 
es ; thaory of sampling applied to 



Harmonic u 



See Mean, bar- 



Harria, J. A,, refs., short method 
of calculating coefficieut of cor- 
relation, 206, 

Haad-brBadths of Cambridge studentH, 
table, 84 ; diagram, 85. 

Helguero, K, de, reft,, dissecting 
compound normal curve, 311. 

Heron, D., refs., relation between 
fertility and nocial status, 205 ; 
defective physique and intelli- 
gence, application of correction 
for age distribution, etc., 222; 
abac giving probable errors of 
correlation coefficient, 349, 3S3 ; 
probable error of a partial correla- 
titiu coefficient, 350. 

Eistogiam, canatniotion of, 84. 

Hollia, T., cited re Gosin's Names of 
the Boman Catholics, etc., 100. 

Hooker, R. H., correlation between 
weather and crops, 196 ; between 
movements of two variables, 201. 
Refa. , correlation between move- 
ments of two variables, 206 ; 
weather and crops, 205, 249 ; 
theory of partial correlation, 248. 

Houses, inhabited and uninhabited, 
in rural and nrban districts, 61- 
62 : annual value of, table, 83 ; 
median, (qu. 4) 131 ; quartiles, 
(qu. 3) 165. 

Hull, 0. H., ref^. The Economie 
Writings of Sir William Petty, 
together iinlh the ObsenxUians an 



IX. 371 

the BilUofMitrfality more probably 
by Captain John GramU, 6. 

Husbands and wives, correlation be- 
tween ages, table, 159 ; diagram, 
173; constants, (qu. 3J 189. 

Hypergeometrical Series, ref., 285. 

Illusory associations, 48-Gl. 
Imbecility, association with deaf- 

mntiam, 83-34, 38. 
Inclusive and eiclusive notations for 



butes, 26-28 ; case of complete, for 
attributes, 58-57 ', form of contin- 
gency or correlation table in case 
of, 71. 

Index-numbers of prices, def., 126 ; 
use of geometric mean for, 126-127 ; 
use of harmonic mean, 129 ; refs., 
130-131. 

Indices, correlation between, 210- 
212; refa,, 222. 

Infirmities, census tabulation of, 14- 
15; associationbetween deaf- mutism 
and imbecility, 33-34, 33. 

Intermediate observations, in a 
frequency- distribution, classiflca- 
tion of, 80-31 ; in correlation table, 
164. 

Isotropy, def., 68 ; generally, 67-71 ; 
of normal oorrelation table, 324- 
327; refa., 73. 

Jeyons, W. Stanley, use of geo- 
metric mean, 127. Refa., system 
of numerically definite reasoning 
(theory of attributes), 15 ; index- 
numbers, 130 ; Pure Logic ami 
other Minor Works, 15 ; Invesliga- 
lioiia m Oitrreney and Finimct, 
130. 

John, v., refs., Oeschiahte der Sta- 
tistik, 5. 

J-shaped frequency -distributions, 68- 
102. 

Kaptbtn, J.,0., refs., Skeui Fre- 
queacy-eurnes in Biology and Stat- 
istics, 130, 310. 

Rick of a horse, deaths from, follow- 
ing law of small chances, 261-262. 



372 

Laplue, Pierre Simon, Marqnis de, 

Eibable error of meditui, 340. 
Th., norma! curve, SIO ; mean 
deriation least about tlie median, 
1 G4 ; TlUorie anaiytiq'ut dea pro- 
babiliUt, 164, 350, SGS. 

Urmor, Sir J., use of word "statia- 
tioal," 4. 

Lee, Alice, data cited from, 96, 122, 
ISO, 161. Refa., inheritance of 
fertility and fecundity, 222. 

Lemna Minor, correlation bfltween 
lengthe of mother- and daDghter- 
frond, 1SG-I87. 

Lexia, W., use of t«rm "precision," 
144. Refs., TheeTU der Maisen- 
ertcheiimirigvn, 2S9; AhhaitdVungen 
zur TAeorie der Bevblkerungt und 
Morai-statUiik, 289, 366, 

LippB, G. F., refs., measures of 
dependence (association, correla- 
tion, contiogencj, etc.), 39 ; Feoh- 
utr'i Kolleklitmuaslehre, 129, 356. 

Little, W., data as to agricultural 
Ubonrere' earnings cited from, 137. 

Lobelia, application of theory of 
sampling to certain data, 265-2S6, 



Haoalister, Sir Donald, ref., law 
of geometric mean, 130, 810. 

Moodonell, W. B., data cited from, 
84, BO. 

Harriage-rate and trade, correlation 
of movements, 199-201. 

Mainell, Clerk, use of ivord "stat- 

Mean, arithmetic, generally, 103-116 ; 
def., 108-lOS ; nature of, 109 ; cal- 
culation of, for a grouped distribu- 
tion, 109-113; inSuence of group- 
ing, 113-114, 115, 116; position 
relatively to mode and median, 121- 
122, diagrams, 118, 114; sum of 
deviations from, is zero, 114 ; of 
series compounded of others, 115; of 
sum or diCference, 115-116 ; com- 
parison with median, 119; sum- 
mary comparison with median and 
mode, mean is the best for all 
general purposes, 122-123 ; weight- 
ing of, 216-221 ; of binomial aeries, 
296 ; standard error of, 340-346. 

deviation. See Deviation, mean. 



THEORY O? STATISTICS. 



generally, 
calculaHon, 



Mean, error, 144. 

geometric, . =. 

123-128; def., 123; colculaHon, 
124 ; less than arithmetic mean, 
1 23 ; difference from arithmetic 
mean in terms of dispersion, (qu. 8) 
156 ; of series componnded of 
others, 124 ; of aeries of ratios or 
products, 124 ; in estimating inter- 
censal populations, 125-126 ; con- 
venienoe for indei-numbera, 128- 
127 ; use on ground that deviations 
vary with absolute magnitude, 127- 
128; weighting of, 221. 

— — harmonic, 108 ; generally, 128- 
129; def., 128; c^culation, 128; 
is less than arithmetic and geo- 
metric means, 129 ; difference &om 
aritbmetic mean in terms of dis- 
persion, (qu. B) 156 ; use in aversg- 
mg prices or indai-numbers, 129 ; in 
theory of sampling, when numbers 
in samples vary, 260-261. 

■ . square error, 144. 

^—weighted, 216-221; def,, 218: 
difference between weighted and 
unweighted means, 217-219 : ap- 
plication of weighting to correctiOD 
ofdeath-»at«a, etc., forage and sei- 
distribution, 21S-221 ; refa., 222. 

Median, 108; generally, 116-120; 
def, 116 ; indeterminate In certain 
oases, 116-117 ; unauited to dis- 
continuous observations, 117; cal- 
culation of, 117; graphical deter- 
mination of, 118 ; comparison with 
arithmetic mean, 119 ; advantogea 
in special cases, 118-120 ; ali^t 
influence of outlying values on, 
120 ; position relatively to meau 
and mode, 121-122, diagrams, 113, 
114; weighting of, 221 ; standard 
error of, 333-337. 

Meitzeu, P. A. , refs. , OesehKhte, 
Tharrie und Taknik der StatUHk, 



Mice, numbers in litters, harmonic 
mean, 128 ; proportions of albinos 
in litters, Quctuations compared 
with theory of sampling, 260-261. 

Milton, John, useotword"at«tiat,"l. 



Mode, 108 ; generally, 120-123 ; def., 
120 ; approzimate deteimination, 
from mwn and mediui, 121-122 ; 
diagrams Bliowiag poeition ra- 
iativaly to mean «nd mBdian 113, 
114 ; logarithmic or geometric mode, 
128; weighting of, 221; reffl., 130. 

Modulus, as measurB of diapereion, 
144 ; origin from Dormal curve, 
300. 

Mohl, Robert von, refs,, Geschichle 
und lAleratur der Slaalavriaafn- 
tuAaflen, G. 

Moment, first, dcf., 110 ; Hocond and 
generaJ, def., 135; calculation of 
moments (ref.), 164. 

Moore, L. Brunley, data cited from, 
96, 181. Ref., inheritance of fer- 
tility and foEundity, 222. 

Mortality, See Death-rrites, 

Movements, correlation of, in two 
variablea, msthods, 197-201 ; refs.. 



Neoativb classes and attributes, 10. 

Neweholme, A., refs., hirth ratea, cor- 
rectioa for age -distribution, etc, 
222 i Filal StaCislies, 355. 

Normal curve of errora : deduction 
from binomial series, 397-298 ; 
value of central ordinate, 300 ; 
table of ordinatfiB, 299 ; mean 
deviation and modulus, 300 ; 
eompariaon with binomial series 
for moderate value of n, 300-301 ; 
outline of more general metboda 
of deduction, 301-301 ; fitting to 
a given diatribution, 303-304 ; the 
table of areas, 306, and its use, 
306-306 ; quartile deviation and 
probable error, 308-307 ; numerical 
examples of use of tables, 307-309 ; 
normality in Huctuations of sani- 
plingof themean, 342-343. Refs,, 
geoeral, 310 ; dissection of com- 
pound curve, 310-311 ; tables, 
853-354. For normal correlation, 
ate Correlation, normal. 

Norton, J, P., data cited from, 182. 
Ref., StaHstical Studies in the !few 
Fork Money Market, 20B. 

Orsbk, ofaclaiia, ID ; of generalised 
correlations, regressions, devia- 
tions, and standard -deviations. 



IX. 373 

PalokAVE, Sir R. H. 1., DiUiimary 

of Potilieal Economy, 6, 
Pareto, V,, refs., Cours d'^conotnie 

politiqiie, 106. 
Partial association. See Association, 

partial. 
Partial correlation. See Gorrelation, 

Pauperiam, in England and Wales, 
table 93 ; diagrams, 92, 113 ; cal- 
culation of mean, 111 ; of median, 
117, 118 ; means, medians, and 
modes for other years, 122 ; stand- 
ard-deviation, 139-140 ; mean 
deviation, 145-146 ; quartilea, 
148 ; percentiles, 161-162, 

correlation with out-relief, 182- 

186 ; with earnings and oat-relief, 
177-181, (qu. 2) 189 ; with out- 
reliaf, proportion of aged, etc, 
192-196, 

Pearl, Raymond, normal distribution 
of number of seeds in Lolvt, 302. 
Ref., probable errora, 360, 

Pearson, Earl, contingeni^y, 63, 06 ; 
mode, 120 ; standard -deviation, 
144 ; coefficient of variation, 149 ; 
skewnesa, 149; inheritance of 
fertility, 195; a [inrions correlation 
between indices, 212 ; binomial 
apparatus, 295 ; . deduction of 
normal curve, 302; data cited from, 
70, 78, 90, 96, 122, 160, 161. 
Refs., correlation of characters 
not quantitatively measurable, 39, 
329 ; contingency, etc, 72-73, 329 j 
frequency curves, 106, 130, 164, 
269, 286, 310, 360 ; binomial 
distribution and machine, 269 ; 
bypergeometrical series, 296 ; dis- 
section of compound normal curve, 
310-311 ; calculation of momenta, 
221 ; general methods of curve- 
fitting, 205, 206 ; testing fit of 
theoretical to actual distribution, 
311 ; correlation, 188, 205-306, 
248, 329 ; fitting of principal area 
and planes, 329 ; correlation be- 
tween indices, 222 ; inheritance of 
fertility, 222: weighted mean, re- 
productive selection, 222 ; probable 
errors, 349, 860. 

Peas, applications of theory of 
sampling to experiments in cross- 
ing, 263-264. 

Pecten, correlation between two 



■or of, 252- 

10 Sampling, 



374 



diameteia of shell, 168 
(qu. 3) 186. 

Percentage, standard en 
253 ; when numbers 
vary, 260-261. Seea.h 
of attributes. 

PeroantileB, 160-153 ; def., 150 ; de- 
tarminaCion , 161-152; advantages 
and diaod vantages, 1G!>-1E3 ; uae 
for unmeasured characters, 152- 
163, refa., 329; standard errors 
of, 338-337 ; sorrelation between 
errors of sampling in, 33T-S38 ; 
lafa., 164. 

Petals of Jfaituncwiiii bulbasus, fre- 
quency of, 102 ; unsuilability of 
mediao Id case of such a distribu- 
tion, 117. 

Peters, J., refs., multiplication table, 



THEORY OF STATISTICS. 



353. 



Ecmurmic 



Petty, 

Ivritings, 6. 
Poincar^, H.> refs., Cale^ des pro- 

labUiUs, 366. 
Poiason, S. D., refs., ses-ratio, 26B ; 

Btchercha sut la probabittU det 

jttgementa, 2U9, 366. 
Poppies, stigmatic rays on, frequency, 

78 ; UDSuitsbilitj of median in 

such a distribation, 116. 
Population, estimation of between 

cenansea, 126-126 ; refs., 130. 
Positive classes and attributes, def., 

10 : number of positive classes, 13 ; 

sufficiency of for tabulation, 13 ; 

eiprassion of other frequencies, in 

terms of, 13-14. 
Precision, 144, 253, 300. 
ftices, indei-uumbera of, 126 ; use of 

geometric mean, 126 ; of harmonic 

mean, 129; i«fs., 130-131. 
Principal axes, in correlation, 317- 

318 ; ref., 329. 

QUARTILB deviation. See Quartiles. 

Quartiles, quartile deviation and semi- 
interquartile range, 184 ; generally, 
147-146; defs., 147; determina- 
Uon, 147-148; ratio of q.d. to 
standard-deviation, 146, 306 : ad- 
vantages of q.d. as a measure of 
dispersion, 148-149 ; difference be- 
tweeu deviations of qiiartiles from 
median aa measure of skewnesa, 
ISO ; ratio of q.d. to median as 
measure of relative dispersion, 146 ; 



,356. 



3.d. of normal curve, 306 
ard errors, 338-337, 337-3 
Quetclet, L. A, J., refs., Lett 
la thiorU des probability, 2( 



Rah DDK sampling, in sense of simple 
sampling, 235. 

Range, unsuitability of, as a measani 
ofdiapersion, 133. 

Ranks, 143, 153 ; methods of corre- 
lation based on (refs.), 829. 

Raitunculaa, frequency of petals, 102; 
unsuitability of median for sneh 
distributions, 117. 

Regiattor- General ; correction of death- 
rates, 220, refs., 222 ; estimates of 
population, refa., 130 ; data cited 
from Reports, 32-33, 62-53, 77, 
98, 163, 197-199, 1B9-201, 218, 
258, 279. 

Regression a, generally, 17S-177; def., 
175 ; total and partial, 229 ; stan- 
dard errois of, 313 ; non-linear, 
201-202, refa., 206. 

Relative dispersion, 149. 

Reserves and discounts in American 
banks, correlation, 162 ; diagram, 
facing 166. 

Rhind, A., ref., tables for comput- 
ing probable errors, 350, 353. 

Rutherford, £., ref., law of small 
chances, 269. 

Sampling, theory of, generally, 250- 
S61 ; the problem, 250-262 ; refs., 
269, 285, 309-311, 349-360, 

of attributes : conditionsossumed 

in simple sampling, 261-252, 256- 
253 ; random in sense of simple 
sampling, E85 ; standai-d deviation 
of number or proportion of successes 
in n events, 252-253, 296-296; 
examples from artificial chance. 



vary, 260-261 ; when chance of 
success or failure is small, 261-262; 
standard error def., 263; compar- 
ing a sample with theory, 263-264 ; 
comparing one sample with another 
independent therefrom, 264-267 ; 
comparing one sample with another 
combined with it, 267-268; limita- 
tions to interpretation of standard 
error when n is small, inverse in- 
terpretation, 272-276 : lUf^ ^ a 



measure of nntniatworthines*, 276- 
277 ; effect of removing coaditianB 
of simple samplino, 277-285 ; sam- 
pling from UniiCsd material, 283 ; 
binomial distribntioD, 2S7-296 ; 
normal curve, 297-309 ; normal 
correlatioD, 313-330. Set also 
Binomial series ; Hypergeometrical 
series i Normal curve; Coirelation, 
normal. 

Sampling, of variables, conditions 
asaumed in simple sampling, 331- 
333; standard eirors of peroe utiles 
(median and qusrtiles), 333-337 ; 
dependence of standard error of 
median on the form of the distribu- 
tion, 334-336; of difference between 
two [lorcentiles, 337-339 ; of arith- 
metic mean, 340-346 ; of difference 
between two mesns, 341-342 ; nor- 
mality of distribution of mean, 
842-343 ; effect of removing con- 
ditions of simple sampHng on 
standard error of mean, 343-346 ; 
Btaudord error of standard -devia- 
tion and coefficient nf variation, 
347 ; of coefficienta of correlation 
and regression, 348, 

Saanders, Miss E. R., data cited 
from, 37. 



Scheibner 


W., differen 


ce between 


arithmetic and geometric, arith- 


metic a 


nd harmonic ra 


eans, (qn. 8 




9) 166. 




Scripture 


E. W., UB. 


of word 



Semi-interquartile range. See Quar- 
Ules. 

Sex -ratio of births : correlation with 
total births, 163, 17B ; diagram, 
176; oonatanta, (qu. 3) 189; 
application of the theory of samp- 
ling to, 258-260, [qu. 7) 271, (qu. 
1, 2) 286, refa., 269; standard 
error of ratio male to female births, 
(qn. 11) 27J 

Sbakeapeare, W., use of word 

Shepparf.'w. F., correction of the 
stondard-deviation for grouping, 
208, 303 ; theorem on correlation 
of a normal distribution grouped 
round raediana, (qu. 4) 330 ; 
normal carve tables, 333 ; standard 
errors of percentilea, 840. Re&., 
calcnlaUon and correction of 



;x. 376 

moments, 221 ; normal curve and 
correlation, theory of sampling, 
310, 329, 350; tables of normal 
function and its integral, 3G4. 

Significant differences, 262. 

Sinclair, Sir John, oae of words 
" statistios,'' " statistical," 2. 

Skew or asymmetrical frequenoy- 
diatributions, 90-102. See also 
Frequ eney - distributiona. 

Skewneaa of frequency -distributiona, 
107 ; moKBuresof, 149-160, 

Sou they, Robert, cited re Cosin's 
JVam£3 0/ the Homnn Calhoiict, 
»U., 100, 

Spearman, C, effect of errors of 
observation on the slindord -devia- 
tion and coefficient of correlation. 



», 210. 



Befs,, 



:of ei 



, 221 ; rank method of 
correlation, 329. 

Standard-deviation. S»e Deviation, 
standard. 

Statist, occurrence of the word in 
Shakespeare and in Miltou, 1. 

Statistical, introdnctiou and develop- 
ment in the meaning of the word, 
1-6 ; S. Account of Seollarul, 2 ; 
Eoyal S, Society, 3 ; methods, pur- 
port of, 3-5, def., 6. 

Statistics, introduction and develop- 
ment in meaning of word, !-6 ; 
def., 5 ; theory of, def,, 5. 

Statures of males in U.K., tables, 88, 
BO ; diagrams, 89, 91 ; calculation 
of mean, 112; means and medians, 
117, (qu. 1) 131; standard-devia- 
tion, 141 ; percentiles, 163; stan- 
dard-deviation, mean deviation and 
qoartilea, (qu. 1) 165 ; distribution 
iitted to normal curve, 301-302, 
303-304, diagram, 302 ; standard 
errors of mean and median, of first 
and ninth deciles, 337, 3SB, 340- 
341, of standaiil-d aviation and 
semi-interquartile range, (qu. 6) 
361. 

correlation of, for father and 

son, 160; diagrams, facing 166, 
174 ; constants, (qu. S) 189 ; test- 
ing for normality, 318-324; dia- 
gram of diagonal distribution, 321, 
of fitted contour lines, 323. 

Stevenson, T. H. C, refs., birth- 
rates, correction of, for age-dis- 
tribution, 222. 



376 



THEORY OF STATISTICS. 



Stigmstdc rays on poppies, frequency, 
73 ; anBDJUbility of mediaa for 
such dutributiotia, IIS, 

Stirlmf!, James, expraasioii for fao- 
torials of large numbera, 800. 

"Student" (pseudonym), re&., law 
of small cbanoee, 2S8 ; probable 
errors, 8G0. 

Symmetrical fretiaency-distributions, 
87-90. See alao Frequency -dis- 
tributions ; Normal curve. 

Symone, G. J., use of word " Bta- 
tistioB " in Brit/iik EainfcUl, 8. 

Tabulation, of statisticB of attri- 
butea, 11-15, 87 ; of a frequency- 
distribution, SI ; of a correlation 
table, 1S4. 

Tatbam, John, nSa., correction of 
de«tb-ratea, 222. 

Thomdike, E. L., reb., metlmds of 
measuring correlation . 329 ; Theory 
ofMtiUaZ and Social Meaaaremeaia, 
3S6. 

Todhunter, I., refa., Hialory of 
the Maihematieat Theory of Proh- 
abUity, 6. 

Type of array, def., 164. 

Ultiuate classes . and frequencies, 
def. , 12 \ sufficiancy of, for tabula- 
tion, 12-13. 

Universe, def., 17 ; specification of, 
17, 18, 

U-shaped frequency distribodona, 
102-105. 

Value, annual, of dwelling-houses, 
table, 83 ; median, (qn. 4) 131 ; 
quattiles, (qu. 8) 165. 

of estates, in 1715, table, 100 ; 

diagram, 101. 

Variables, theory of, generally, 76- 
249; def.,7, 7E. 

Variates, def., 150. 

Variation, coefficient of, 149; stan- 
dard error of, 347. 

Venn, John, refs,, ioyte of Chance, 



Vsrschaeffelt, G., relative dispersioD, 
149. Kefs., measure of reUfivedis- 
peision, I Si. 

Vigor, H. D., data cited (ixmr, 163. 
Refs., sex-ratio, 266. 

Wagbb of agricultural Iabonr«rs, see 
BarainRs. 

Warner, F. , refs., study of defects in 
school -children, notation for stat- 
istics of attributes, IB. 

Waters, A. C, refs., estimating in- 
tercensal popnlationB, 130. 

Weather and crops, correlation, 196- 
197. 

Weighted Mean, see Mean, weighted; 
also Mean, geometric; Medisin ; 
Mode. 

Weights of males in U.K., table, 95 ; 
diagram, 94 ; mean, median, and 
mode, (qu. 2) 131; standard -devia- 
tion, mean deviation and qnartiles, 
(qu. 2) 155. 

Weldon, W. F. R., dice-throwing 
experiments, 2S4-255. 

Westergaard, H., refs., Theorie der 
StaUatik. 8, 269, 366. 

Tdl«, G. U., use of term character- 
istiolinestiinesof regression), 177: 
problem of pauperiam. 1S2 ; data 
cited from, 78, 93, 122, 140, 163, 
186, 265. Refs., history of words, 
"statistics," "statistical," 6; at- 
tributes, association, conEdstance, 
etc., 16, 23, 39, 57 ; isotropj, infln- 
ence of bias in statistics of qualities, 
7S; correlation, 188, 222, 248; cor- 
relation butween indices, 222 ; fre- 
quency-curves, 310; probable 
errors, 350 ; pauperism, 130, 205, 
249; birth-rates, 206, 222: lei- 
ratio, 269. 



H. , multiplioatjon table, 363. 

Zizek, F., refs,. Die aUUiatischint Mit- 
ithnerthe, 129. 



n,gN..(jNGoogle 



(ji-vGooglc 



n,gN..(JNGOOglC 



ntiuniM I u me circuraiion a«9K ui wiy 
University of Catifomia Library 
or to the 
NORTHERN REGIONAL LIBRARY FACILITY 
BIdg. 400, Richmond Field Station 
University of California 
Richmond, CA 94804-4698 



~ ALL BOOKS MAY BE RECALLED AFTER 7 DAYS 
2-month loans may be renewed by calling 
(510)642-6753 
— 1-year loans may tw recharged by bringing books 
_ to NRLF 

Renewals and recharges may be made 4 days 
!Zl prior to due date 






DUE AS STAMPED BELOW 



FEB 91994 



Ml«»«5 



RECFIV ED 



JUL 5 ma 



JUL 8 1399 



U.& BERKELEY 



— - °'RraMT,ONpfc^.| oC T OBilUUU 
JAN 1 * 19 9 6 



